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ABSTRACT 


Experiments  vere  performed  to  determine  the  feasibility  of  using 
ALCAPP  as  one  form  of  on-line  dialogue. 

Assuming  the  ALCAPP  (Automatic  List  Classification  and  Profile 
Production)  system  is  in  an  on-line  mode,  investigations  of  those 
parameters  which  could  affect  its  stability  and  reliability  were 
conducted.  Fifty-two  full  text  documents  were  used  to  test  how  type 
of  indexing,  depth  of  indexing,  the  classification  algorithm,  the  order 
of  document  presentation,  and  the  homogeneity  of  the  document  collec¬ 
tion  would  affect  the  hierarchical  grouping  programs  of  AI£APP.  Six 
hundred  abstracts  vere  used  to  study  the  effect  on  document  clusters 
when  more  documents  are  added  to  the  data  base  and  the  effect  on  the 
final  cluster  arrangement  when  the  initial  assignment  of  documents  to 
clusters  is  arbitrary. 

Results  reveal  that  the  only  time  significant  differences  in  the 
classification  of  documents  does  not  occur  is  when  the  order  of  docu¬ 
ment  presentation  is  varied.  Final  clusters  are  significantly  affected 
by  the  initial  assignment  of  documents  to  clusters.  The  number  of 
documents  added  to  a  data  base  allows  stability  of  clusters  only  to  a 
cutoff  point  which  is  some  percentage  of  the  original  number  of  docu¬ 
ments  in  the  data  base. 
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SBCnON  I 
INTRODUCTION 

Just  as  a  library  divides  its  collection  of  books  into  subject  categories, 
bo  must  an  automated  system  organize  its  files  into  sections  to  achieve 
efficient  storage  and  retrieval.  However,  if  the  classification  ■*’ 
documents  in  an  automated  system  is  done  manually,  some  of  the  advantages 
of  the  high-speed  computer  are  lost,  due  to  the  delays  in  preparing  the 
input.  This  problem  has  been  recognized  by  researchers,  and  a  number  of 
attempts  have  been  made  to  devise  a  classification  algorithm  that  would  be 
both  reasonable  and  economic  ally  feasible. 

In  traditional  classification  systems,  skilled  librarians  classify 
documents  into  categories  on  the  basis  of  subject  content.  In  an 
automated  system,  wf wre  the  work  of  classification  must  be  carried  out 
by  computers  and  not  by  people,  class  membership  is  determined  on  the 
basis  of  the  words  contained  in  the  document  or  in  a  list  of  index  terms 
ascribed  to  the  document.  This  is  a  radically  different  principle,  but 
it  is  a  reasonable  one.  Ideas  axe  expressed  in  words,  anu  documents  on 
different  topics  will  use  different  sets  of  words  to  express  Ideas. 

It  follows,  therefore,  that  documents  can  be  ordered  into  classes  on  the 
basis  of  similarity  car  differences  in  vocabulary.  It  is  further  postulated 
that  classifying  documents  in  accordance  with  the  principle  of  similar  word 
usage  would  result  in  a  classification  system  analogous  to,  cut  not  iaentical 
with,  traditional  subject  categories  and  one  that  would  be  usable  by  both 


men  and  machines. 
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A  number  of  mathematical  techniques  for  deriving  classification  systems 
have  been  suggested.  These  include  clump  theory,  factor  analyses, 
latent- class  analysis,  &'  ^crimination  analysis  and  others.  References 
to  these  techniques  along  with  a  brief  description  may  be  found  in 
Automated  Language  Processing  (Borko,  1967)-  In  general,  all  of  the 
above  procedures  require  lengthy  computation  and  the  amount  of  computer 
time  increases  qy  some  factor,  either  the  square  or  the  cube,  of  data 
base  size.  As  a  result,  these  sophisticated  taxonomic  techniques  are 
impractical  when  applied  to  large  data  bases. 

Lauren  Doyle  (1966),  in  a  research  project  supported  by  the  Rome  Air 
Development  Center,  devisea  a  procedure  for  breaking  this  inpasse,  and  he 
described  a  method  of  automatic  classifier  ion  that  uses  conputer  time 
in  direct  proportion— as  a  logoritbmic  function— to  the  number  of  items 
in  the  base.  The  programs,  called  ALCAFP  (Automatic  List  Classification 
and  Profile  Production),  are  based  upon  the  techniques  of  Joe  Ward 
(Ward  and  Hook,  1963) •  Doyle's  work  was  a  major  methodological 
contribution,  for  it  removed  a  great  obstacle  from  the  path  toward 
practical  automatic  document  classification. 

The  current  project  was  a  continuation  of  the  study  of  automatic 
classification  techniques  and  had  as  its  major  tasks: 

(l )  To  investigate  the  statistical  reliabilities  of  the  ALCAPP 
algorithms. 
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(2)  To  evaluate  the  effectiveness  of  the  machine-produced  classification 
hierarchy  as  an  aid  in  predicting  document  content  and  as  an 
adjunctive  retrieval  tool  in  an  on-line  time- shared  system. 

( 3)  To  recode  the  ALCAPP  programs  for  operation  on  the  GE  635  computer 
which  is  available  for  use  at  the  Rome  Air  Development  Center. 
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SECTION  II 

SELECTION  OF  THE  DATA  BASE 

Since  the  RADC  contract;  under  whose  sponsorship  these  studies  were 
conducted,  did  not  specify  the  subject  content  of  the  data  base,  it 
was  decided  to  use  documents  in  the  field  of  information  science. 

The  main  advantages  are  that  these  documents  are  readily  available  at 
System  Development  Corporation  and  that  SDC  employs  a  number  of  experts 
in  this  area.  If  necessary,  these  people  could  be  used  to  evaluate 
the  reasonableness  of  the  data  processing  results,  e.g. ,  indexing  and 
classification,  and  the  effectiveness  of  the  system.  On  the  negative 
side,  information  science  does  not  have  a  well- specified  thesaurus  or 
authority  list  of  terms  for  use  in  indexing. 

After  consultation  with  the  contract  monitors  at  RADC,  it  was  agreed 
that  the  advantages  of  using  a  data  base  of  information  science  materials 
outweighed  the  disadvantages.  With  their  concurrence,  the  following 
documents  were  selected: 

(1)  The  full  text;  of  the  52  papers  that  were  printed  in  the 
Proceedings  of  the  1966  American  Documentation  Institute  Annual 
Meeting  (Black,  1966). 

(2)  The  abstracts  of  600  other  documents  in  the  field  of  information 


science. 
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Simplified  keypunching  rules  were  specified  by  the  contractor 
and  approved  by  the  monitor  ( see  the  Appendix) .  The  entire 
data  base— that  is,  both  the  j?2  doc  meats  and  the  600  abstracts— was 
keypunched  In  accordance  with  these  rules  and  was  thus  made  available 
for  computer  processing.  The  52  full-text  documents  were  used  tc 
study  the  reliability  and  consistency  of  the  automatic  classification 
procedure.  A  subset  of  these  documents  was  used  In  the  experiments 
Judging  the  incremental  value  of  the  classification  hierarchy  In 
predicting  document  content  and  relevance.  The  abstracts  were  used 
to  study  the  stability  of  the  classification  categories  as  new  documents 
are  added  to  the  data  base. 
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SECTION  III 

PREBIRATION  OF  WORD  LISTS 

The  data  used  as  input  to  the  classification  programs  were  lists  of 
Index  terms  derived  from  the  documents,  and  not  the  documents  themselves 
or  their  natural  language  abstracts.  By  indexing  each  document  both 
manually  and  L„  machine-aided  methods,  the  type  and  quality  of  the 
indexing  was  varied.  The  length  of  the  word  lists  was  also  varied  by 
creating  lists  of  6,  15,  and  30  terms  each.  Thus,  it  was  possible  to 
determine  the  effect  that  the  type  and  depth  of  indexing  would  have  on 
the  reliability  and  consistency  of  the  resulting  classification  systems. 

1.  PREPARING  WORD  LISTS  FROM  TEE  52  FULL-TEXT  DOCUMENTS 
The  52  full-text  documents  were  to  be  used  to  investigate  the  effect  of 
indexing  type  and  indexing  depth  on  the  reliability  and  consistency  of 
the  ALCAPP  classification  procedures.  To  do  so,  each  document  was 
indexed  by  six  different  procedures  as  follows: 


Human  Indexing 

30  terms 

Human  Indexing 

15  terms 

Human  Indexing 

6  terms 

Machine-Aided  Indexing 

30  terms 

Machine-Aided  Indexing 

15  terms 

Machine-Aided  Indexing 

6  terms 
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a.  Human  Indexing 

The  "human  indexing"  was  done  by  trained  librarians  from  the  SDC  library 
staff.  They  were  given  copies  of  the  52  documents  and  asked  to  assign 
30  appropriate  subject  headings.  They  were  asked  to  use  a  free  vocabvJLary, 
since  no  authority  list  was  available.  They  were  also  instructed  to  arrange 
the  terms  in  a  rough  order  of  importance,  so  that,  for  each  document,  the 
first  6  terms,  the  first  15  terms,  and  the  complete  list  of  30  terms  could 
be  used  separately  for  different  phases  of  the  experiment.  In  some  instances, 
the  indexers  found  it  impossible  to  list  30  terms,  and  shorter  lists  were 
accepted. 

Since  there  were  no  controls  over  the  vocabulary,  some  editing  was  necessary 
in  order  to  achieve  a  degree  of  consistency  and  compatibility.  The  index 
terms  were  keypunched,  and  sorted  alphabetically,  by  frequency,  and  by 
individual  document.  These  lists  were  returned  to  the  indexers  for 
editing  and  modification.  Variations  in  the  use  of  plural  and  singular 
endings  were  changed,  e.g.,  COMPUTER  was  changed  to  COMPUTERS;  certain 
modifiers  were  dropped,  e.g.,  MAGNETIC  TAPE  STORAGE  was  changed  to 
MAGNETIC  TAPE;  word  order  was  standardized,  e.g,  ABSTRACTING,  AUTOMATIC 
become  AUTOMATIC  ABSTRACTING;  near  synonyms  were  combined,  e.g.,  AUTHORITY 
LISTS  was  merged  into  AUTHORITY  FILES;  and  some  namer  were  abbreviated,  e.g., 
COMMITTEE  ON  SCIENTIFIC  AND  TECHNICAL  INFORMATION  became  COSATI,  etc. 

The  sole  aim  of  the  editing  was  tc  achieve  consistency  in  the  use  of 
terms  for  this  experiment.  It  was  not  our  purpose  co  create  a  generally 
useful  lexicon.  No  attempt  was  made  to  combine  generally  similar  terms 
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lnto  a  single  concept  If  the  indexer  believed  them  to  be  separate,  so 
that  ALGEBRA,  ABSTRACT  and  ALGEBRA,  MODEM  were  retained  as  separate 
terms.  Similarly,  if  a  single  document  was  indexed  by  both  COMMUNICATION 
and  COMMUNICATION  OF  TECHNICAL  INFORMATION,  both  terms  were  retained. 

For  mechanical  reasons  and  In  order  to  reduce  computer  processing  time, 
each  term  was  truncated  at  15  alpha  characters.  In  those  Instances  where 
truncation  could  cause  ambiguity,  the  numeric  digits,  1,  2,  3,  etc., 
were  added  to  Insure  uniqueness. 

b.  Machine-Aided  Indexing 

While  It  was  not  the  purpose  of  this  study  to  devise  methods  of  automatically 
Indexing  textual  material,  the  project  staff  did  process  the  documents  In 
the  data  base  and  prepared  word  lists  as  aids  In  the  selection  of  Index 
terms.  Each  of  the  52  complete  documents  was  processed  individually  to 
create  an  alphabetical  list  of  all  words  used  In  the  text,  together  with 
their  frequency  of  occurrence.  This  basic  list  was  then  reordered  so  that 
the  word  with  the  highest  frequency  would  be  listed  first  and  the  others 
would  follow  in  descending  order.  Then,  using  an  available  routine  that 
would  combine  plural  and  singular  forms  of  the  same  root,  the  alphabetically 
ordered  list  was  rerun  and  words  with  the  same  root  combined.  Next  the 
Individual  lists  of  all  52  documents  were  merged,  creatirg  a  unified 
frequency- ordered  list  in  which  singulars  and  plurals  were  combined.  An 
alphabetically  ordered  listing  was  also  obtained. 


MANUAL  INDEXING 


DOCUMENT  NO.  1:  Progress  in  Internal  Indexing*" 


information  storage  and  retrieval 
systems 
indexing 

automatic  indexing 
Indexing,  manual 

abstracting  and  indexing  services 
subject  indexing 
c  omputers —  applic  ati  ons 
computers- -applications- -writing  and 
editing 

content  analysis  (computers) 

machine  translation 

cataloging  of  technical  literature 

computers— machine- readable  text 

computers— research 

information  science — research 

cataloging 


documentation 

data  processing  systems — libraries 
computers— applications— libraries 
report  writing 
researc h— indexes 
congresses  and  conventions — 
abstracting  and  indexes 
books — abstracting  and  indexes 
word  files 
sentence  files 
punched  cards 
sentence  entities 
recursive  procedures 
indexing  term  selection 
term  dictionaries 
internal  indexing 


Figure  1.  Manual  Indexing  of  Document  No. 
before  Truncation 


^Maloney,  C.  J.  and  M.  H.  ^isteln.  Frogrese  in  Internal  Indexir 
(Black,  1966) 
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MACHINE  INDEXING:  ABSTRACT 

Title:  Identifying  and  locating  Standards 


1.  standard 

16.  quality 

2.  subject 

17.  symbol 

3.  number 

18.  L3CA  (abstractor's  code) 

4.  type 

19.  identification 

5-  report 

20.  deal 

6..  association 

21.  produced 

7.  national 

22.  difficulty 

8.  organization 

23-  microfiche 

9.  ere a 

24.  image 

10.  international 

25.  cover 

11.  requirement 

26.  practice 

12.  individual 

27.  initial 

13.  code 

26.  firm 

14.  identify 

29.  encountered 

15.  librarian 

30.  specification 

Figi  re  3.  Beadle  of  Machine- Aided  Indexing 
of  an  Abstract 
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The  task  of  the  editor  was  to  select  30  words  for  each  of  the  52  documents 
for  input  to  the  Hierarchical  Grouping  Program.  Each  list  hod  to  be  so 
arranged  that  the  first  6  terms  and  the  first  15  terms  could  themselves 
constitute  lists  for  processing.  Using  these  various  computer  prepared 
printout r,  the  editor  was  able  to  make  the  selection  reasonably  efficiently, 
and  tc  prepare  the  lists  for  subsequent  cavuter  processing. 

2.  PRffARING  WORD  LISTS  PROM  THE  600  DOCUMENT  ABSTRACTS 
The  600  document  abstracts  were  used  In  the  experiments  designed  to  test 
the  stability  of  the  groupings  which  result  from  the  application  of  the 
Cluster  Finding  Program.  Word  lists  had  to  be  prepared  from  each  of  the 
abstracts  for  Input  to  this  program.  These  lists  were  prepared  by 
computer  analysis  on  the  basis  of  frequency  of  word  occurrences  and  then 
minimally  edited  by  the  Investigators.  The  abstracts  were  not  Indexed 
manually,  since  our  objective  was  not  to  determine  whether  there  would 
be  differences  In  the  clustering  due  to  differences  between  — rani  and 
machine  indexing.  The  extent  of  such  differences  would  be  determined 
more  precisely  in  the  hierarchical  structuring  experiments,  using  full 
documents.  With  the  600  abstracts,  our  only  purpose  was  to  measure 
the  stability  of  the  resulting  clusters,  and  for  this  purpose  we  used 
lists  of  30  terms  prepared  by  cosgnrter  analysis  of  the  abstracts. 
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s  sen  ok  iv 

MEASURING  TIE  RP.TAHIT.JTY  AMD  CONSISTENCY  OF  THE  ALCAFP  SYSTEM 

A  classification  syrton  la  couai dared  to  be  reU.able  If  documents 
classified  into  a  given  category  will  be  classified  into  that  same  category 
on  subsequent  trials..  If  the  system  is  not  reliable  and  the  document 
classifications  vary,  classification  will  not  be  a  useful  adjunct  to 
retrieval.  Reliability  and  consistency  are  necessary,  but  not  sufficient, 
conditions  for  a  useful  classification  system.  Therefore,  the  first 
series  of  experiments  were  designed  to  investigate  the  reliability  and 
consistency  of  the  ALCAFP  system  and  the  variables  that  affect  the 
reliability  of  the  automatic  classifications. 

1.  DESCRIPTION  Of  THE  HIERARCHICAL  GROUPING  PROGRAM 
It  is  obvious  that  different  classification  procedures  will  result  in 
different  classification  structures.  The  Library  of  Congress  Classification 
schedule  differs  from  the  Dewey  Decimal,  and  clearly  a  machine- derived 
system  will  differ  fron  both  cf  these.  The  task  of  thi c  project  was  to 
determine  the  statistical  properties  of  the  ALGAPP  method  of  machine 
classification,  and  not  cac^ar  e  it  with  manual  methods. 

Machine  classification  is  based  on  the  assueption  that  documents 
containing  more  words  in  common  are  more  similar  to  each  other  ir  content 
than  are  documents  which  have  fever  words  in  common.  Bach  document  is 
represented  by  a  surrogate  or  list  of  index  terms.  As  was  pointed  out 
in  the  previous  section,  these  index  terms  could  be  derived  either  manually 
or  try  machine.  The  ALCAPP  programs  begin  by  coopering  these  lists  and 
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counting  the  number  of  identical  terms  in  each  pair.  It  constructs 
a  matrix— or  rather  a  half  matrix,  since  the  data  are  symmetric  around 
the  diagonal — in  which  each  cell  contains  the  number  of  terms  the 
two  word  lists  share  in  common.  In  this  study ;  52  document  index 
lists  were  compered,  so  the  actual  number  of  comparisons  is 

54a .  1,326. 


The  program  searches  the  matrix  and  finds  the  largest  cell  value, 

i.e. ,  that  pair  of  documents  with  the  most  terms  in  common.  In 

case  of  tie,  the  first  value  is  chosen.  These  two  documents — let's  call 

them  and  Dj — are  now  chosen  to  be  the  first  pair  in  the  hierarchical 

classification  structure.  This  completes  the  first  iteration  of  the 

program. 


In  the  second  iteration,  documents  i  and  J  have  been  eliminated  and 
combined  into  one  value— call  it  G^.  A  new  matrix  is  created  of  order 
N-l,  or  51*  Documents  i  and  j  are  excluded,  but  in  their  place  is  a 
new  vector  G^.  The  program  now  calculates  the  similarity  of  the 
remaining  documents  to  G^  and  places  this  value  in  the  appropriate 
cell.  It  then  searches  the  matrix  for  the  highest  value.  This  value 
can  represent  the  similarity  of  two  documents,  as  was  the  case  in  the 
•first  iteration,  or  it  can  represent  the  similarity  of  a  document  with 
G^ .  If  the  former  is  true,  the  two  documents  will  be  combined  to  form 
n  new  group  of  two,  while  in  the  latter  instance,  a  third  document  will 
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be  added  to  the  first  group.  In  either  case,  the  new  group  is  called 
G0,  and  the  program  has  completed  the  second  iteration. 

To  complete  the  entire  hierarchical  grouping  structure,  one  less 
iteration  than  there  ere  documents  in  the  set  will  be  required.  The 
last  iteration  will  form  a  single  group  containing  the  entire 
collection. 

A  mathematical  description  of  the  classification  process  can  be  found 
in  the  documents  by  Ward  (1959) >  Ward  and  Hook  (1963) ,  and  Baker  (1965). 
By  using  the  same  basic  technique  but  varying  the  function  used  to 
calculate  similarity,  different  hierarchical  structures  can  be  formed. 
(Figures  4  and  6  are  examples.)  The  program  can  also  label  the  nodes 
of  the  structure,  thus  providing  an  indication  of  the  common  elements 
that  link  the  documents  together  (see  Figures  5  and  7  as  examples). 

2,  MEASURING  CLASSIFICATION  SIMILARITY 

Classification  similarity  is  measured  by  means  of  a  distance  matrix 
that  provides  a  measure  of  the  distance  separating  each  document  from 
every  other  document  In  the  classification  system.  This  procedure  is 
used  to  provide  a  rigorous  measure  of  the  reliability  and  consistency 
of  automatic  classification  under  different  laboratory  conditions. 

For  these  purposes  a  small,  intensively  analyzed  sample  of  52  documents 
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In  these  experiments,  the  distance  matrix  was  a  synmetrical  matrix  of 
order  52,  for  we  were  using  52  documents.  The  number  in  each  cell 
(the  intersection  of  each  row  and  column)  represents  the  number  of 
the  level  at  which  the  two  documents  are  joined.  A  distance  matrix 
was  computed  for  the  12  documents  distributed  on  the  hierarchical 
clustering  scheme,  as  illustrated  in  Figure  4.  The  half-matrix  of 
distances  is  shown  in  Figure  8. 

Once  the  distance  matrices  have  been  computed,  it  now  becomes  possible 
to  determine  the  degree  of  similarity  between  any  or  every  two  matrices 
by  correlating  the  respective  columns.  Thus,  it  also  becomes  possible 
to  measure  the  importance  that  such  variables  as  the  depth  or  type  of 
indexing  would  have  on  the  similarity  of  the  resulting  classifications. 

3-  VARIABLES  RELATED  TO  CLASSIFICATION  SIMILARITY 

What  makes  one  classification  scheme  similar  to  another?  What  variables 
affect  the  degree  of  similarity  between  two  classification  schemes? 

These  studies  were  designed  to  shed  some  light,  in  the  form  of  statistical 
data,  on  the  intuitive  answers  that  are  usually  given  to  the  above 
questions. 

Based  upon  a  logical  analysis,  the  following  five  variables  are  believed 
to  be  related  to  classification  similarity: 

(1)  The  homogeneity  of  the  document  collections. 

(2)  The  classification  procedure,  or  algorithm,  being  used. 


% 


t 


-22- 


(3)  The  type  or  quality  of  indexing— whether  it  he  term  or  concept 
indexing. 

(4)  The  depth  of  indexing  used. 

(5)  The  order  in  which  the  documents  are  processed. 

These  variables  were  coopered  systematically  in  order  to  determine  their 
effect  on  the  resulting  classification  structures. 

a.  The  homogeneity  of  the  Document  Collection 
One  of  the  variables  that  could  affect  the  reliability  of  the  classifi¬ 
cation  procedure  is  the  homogeneity  or  diversity  of  the  document 
collection.  To  test  the  effect  of  this  variable,  we  would  need  four  or 
five  different  document  collections,  and  these  collections  would  have  to 
span  a  range  from  a  narrow  hard  science  collection,  such  as  solid  state 
physics,  through  perhaps  the  broader  field  of  geology  on  to  the  still 
broader  field  of  social  sciences.  This  project  is  basically  a  pilot 
study,  and  because  of  time  and  cost  constraints,  we  decided  not  to 
manipulate  the  data  base  as  a  variable  in  this  experimental  design* 
Instead,  we  hjpi  the  hcaogeneity  factor  constant  by  limiting  the 
analysis  to  the  field  of  information  science  documentation.  The  selected 
collection  of  documents  probably  constitutes  a  aid- range  position  on  the 
scales  of  diversity  and  hardness  of  data,  for  it  covers  a  single, 
relatively  homogeneous  but  broad  subject  area  in  the  social  sciences. 
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b.  The  Classification  Algorithm 

In  the  course  of  SDC's  research  program  on  automated  classification, 
a  number  of  different  algorithms  have  been  developed.  While  all  of 
them  use  basically  the  same  technique  described  in  Paragraph  1,  they 
do  differ  in  the  averaging  function  used— the  mathematical  formulas 
for  computing  a  value  of  group  similarity. 

Two  such  algorithms  seemed  particularly  worth  investigating  and 
comparing.  These  were  arbitrarily  called  WD-2  and  WI>-3  in  a  sequence 
of  modifications.  The  WD-2  algorithm  maximizes  the  within- group 
similarity  function  and  puts  a  premium  on  preserving  the  homogeneity 
of  groups  that  have  already  been  formed.  The  WD-3  algorithm  takes 
an  opposite  approach  and  combines  lists  that  have  a  adnlmim  dissimilarity 
as  contrasted  with  a  maximum  similarity. 

While  it  might  appear  on  the  surface  that  these  functions  should 
perform  similarly  in  forming  groups,  this  need  not  be  the  case,  far 
the  program  examines  different  data  (see  Figures  4  and  5).  By 
including  both  programs  in  the  experimental  design,  it  was  possible 
to  compare  the  form  and  reliabilities  of  the  classification  structures. 

c.  The  Type  (or  Quality)  of  Indexing 

The  documents  to  be  classified  were  indexed  by  qualified  librarianr 
who  were  instructed  to  use  multi-word  concept  or  subject  indexing. 

In  addition,  these  same  doc meats  were  indexed  by  key  words  using 
machine- aided  selection  techniques .  Obviously,  the  lists  differ. 
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The  question  "being  investigated  is:  Do  classification  systems 
based  on  human  indexing  differ  significantly  from  classification 
systems  based  upon  machine- aided  indexing? 

d.  The  Depth  of  Indexing 

Since  the  inputs  to  the  classification  program  are  lists  of  words, 
it  was  important  to  investigate  the  effect  that  different- sized  lists 
would  have  on  the  reliability  of  the  classification  structure.  In 
order  to  teat  the  effect  of  this  variable,  different  length  lists, 
containing  6,  15,  and  30  terms  each,  were  used  and  varied  systematically. 

e.  The  Older  of  Document  Presentation 

In  the  description  of  the  hierarchical  grouping  program  (paragraph  1),  it 
was  explained  that,  although  the  program  combined  documents  into  groups 
by  searching  the  similarity  matrix  for  the  highest  cell  value,  when 
more  than  one  cell  had  the  same  value,  the  first  position  was  used  to 
form  the  group.  As  a  result  of  the  procedure  used,  the  order  in  which 
the  documents  are  processed  could  affect  the  final  hierarchical 
classification  structure. 

A  series  of  experiments  were  designed  to  determine  whether  the  order 
of  document  presentation  would  cau*e  significant  differences.  The  52 
documents  were  arranged  in  three  different  orders  for  input  to  the 
computer  program.  The  documents  were  numbered  from  1  to  52.  The 
first  order  arranged  the  documents  in  e  -'ending  mnerical  value.  The 
second  order  vas  the  reverse,  with  document  mmber  52  being 
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processed  first.  And  the  third  order  was  a  random  arrangement  of 
the  documents.  For  each  of  these  three  arrangements,  hierarchical 
groupings  of  the  52  documents  were  computed  and  their  structures 
conpared  for  similarity. 

4.  THE  KXPHUMarCAju  DESIGN 

The  aim  of  this  set  of  experiments  was  to  investigate  the  -^liability 
and  consistency  cf  automatically'  derived  classification  hierarchies, 
as  selected  attributes  are  varied  in  a  controlled  fashion.  The  four 
selected  attributes  are: 

(1)  The  classification  algorithm: 

WD-2 

WD-3 

(2)  The  type  of  document  indexing: 

M  =  machine- aided 
H  *  human 

(3)  The  depth  of  indexing: 

6  terms 
15  terms 
30  terms 

(d)  The  order  of  document  Input  for  processing: 

01  *  ascending  order  1-52 
02  *  descending  order  52-1 
03  =  random  order 
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In  order  to  vary  the  attributes  systematically,  under  all  possible 
conditions,  36  hierarchical  classification  structures  were  required 
(2x2x3x3).  Figure  9  lints  all  36  classification  matrices  and  the 
particular  attributes  that  were  used  in  their  construction. 

Once  the  classification  structures  .rare  derived  by  machine  processing, 
the  information  contained  therein  was  transformed  into  sets  of  distance 
matrices,  which  were,  themselves,  correlated.  'Die  outcome  of  the 
correlation  program  was  a  36  x  36  matrix,  in  which  the  rows  and  columns 
are  the  36  different  classification  structures,  and  a  cell  value  is 
the  correlation  coefficient  indicating  the  similarity  between  the  pair 
of  classification,  schemes.  The  complete  correlation  matrix  is 
reproduced  as  Figure  10. 

The  following  criteria  were  used  in  interpreting  the  correlation 
matrix: 


(l)  High  {similarity 

r  =  .  /0  to  .99 

(2)  Moderate  Similarity 

.kO  to  .69 

(3)  Slight  Similarity 

.20  to  .39 

(4)  No  Similarity 

.00  to  .19 

These  numbers  and  ranges  are  useful  in  making  comparative  judgments 
and  not  for  absolute  scalar  judgments. 
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In  addition  to  the  entire  36  x  36  matrix,  which  contains  1,296 
values,  sections  of  the  matrix  are  presented  below  in  tabular  form  as 
these  data  relate  to  the  attributes  being  investigated. 

a.  The  Effect  of  the  Classification  Algorithm— WD- 2  or  WD-3 — 
on  the  Reliability  of  the  Classification  Structure 

Since,  as  was  discussed  in  Paragraph  3b*  the  algorithms  used  in  the 

WD-2  and  WD-3  programs  are  different,  it  was  desirable  to  investigate 

the  degree  of  similarity  between  the  classification  structures  that 

result  from  their  use.  Would  those  different  machine  procedures  yield 

very  different  or  very  similar  classification  structures?  The  results, 

recorded  in  the  last  column  of  Figure  11,  list  the  values  of  the 

correlation  coefficient  as  varying  from  .28  to  .63.  These  figures 

indicate  that  there  is  a  slight  to  moderate  degree  of  similarity 

between  the  structures  derived  by  the  two  classification  procedures. 

These  results  are  in  accord  with  our  intuitive  expectation,  and  they 

reinforce  our  notion  that,  even  though  the  inputs  axe  the  same, 

different  machine  classification  algorithms  will  result  in  different 

document  groupings — the  degree  of  similarity  being  dependent  upon 

the  similarity  of  the  procedures  used.  This  last  statement  should 

perhaps  be  modified  somewhat,  for  the  data  seem  to  suggest  that 

machine  classification  based  upon  machine- derived  index  terms  is 

slightly  more  reliable  than  is  machine  classification  based  upon 

concept  index  terms.  However,  this  is  only  a  tentative  fornrulation, 
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based  upon  the  fact  that  the  correlation  coefficients  are  slightly 
higher  for  machine  indexing  than  for  human  indexing.  These  results 
need  to  be  verified  before  they  can  be  given  much  credence. 

b.  The  Effect  of  Indexing  Itype  on  the  Reliability  of  the 
Classification  Structure 

This  section  describes  the  study  of  the  differences  in  classification 
structure  caused  by  human  and  machine- aided  indexing  as  all  other 
variables  are  held  constant.  It  will  be  recalled  that  human  indexing 
was  done  by  trained  librarians  from  the  SDC  library  staff  who  were 
asked  to  assign  appropriate  multiple-word  concepts  as  index  terms. 

In  contrast,  the  machine-aided  indexing  was  of  the  single-word  Uniterm 
type. 


The  table  in  Figure  12  is  arranged  to  show,  in  a  clear  and  unmistakable 
manner,  the  importance  that  indexing  style — human  or  machine- aided — 
has  on  the  structure  of  the  automatically  produced  classification 
hierarchy.  Column  A  is  the  same  throughout  the  18  rows;  the  letters 
M  and  H  simply  indicate  that  in  all  cases  we  will  be  comparing  machine 
and  human  indexing.  The  first  nine  rows  of  column  B  indicate  that  we 
will  initially  examine  the  data  generated  by  classification  algorithm 
WD-2  and  then  lock  at  Wl>-3.  Column  C  Indicates  the  depth  cf  indexing 
and  column  D  the  order  of  input.  In  column  E  are  listed  the  pairs  of 
classification  structures  that  meet  all  preceding  conditions  (see 
Figure  9);  and  in  the  last  column,  the  .alue&  of  the  appropriate 
correlation  coefficients  are  recorded. 
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These  data  are  of  great  significance.  They  clearly  show  that,  given 
the  saae  set  of  documents,  machine- aided,  indexing  based  upon  key  words 
will  result  in  an  entirely  different  distribution  of  the  parent 
documents  in  the  machine- produced  classification  structure  than  would 
be  obtained  if  the  input  lists  were  multiple- word  concept  terns 
prepared  by  skilled  humans.  Note  that  we  have  not  said  that  one 
structure  is  better  than  the  other  (we  discuss  utility  in  Section  VI 
of  this  report)  but  only  that  the  structures  are  significantly  different 
to  a  degree  that  most  of  os  would  not  have  anticipated. 

All  the  values  in  column  F  are  essentially  zero,  with  the  exception 
of  the  values  in  rows  ID,  11,  and  12,  which  Bhow  slight  positive 
correlations.  These  occur  under  classifications  procedures  WD-3  when 
the  depth  of  Indexing  is  six  terms.  Under  these  conditions,  there  are 
the  greatest  similarities — although  still  slight — between  the  classifi¬ 
cation  structures  derived  from  human  and  machine- aided  indexing.  This 
finding  is  consistent  with  the  findings  that  the  effect  of  depth  of 
indexing  is  less  marked  when  human  index  lists  of  six  terms  are 
processed  by  the  VD-3  program. 

c.  The  Effect  of  Indaring  Depth  on  the  Reliability  of  the 
Classification  Structure 

The  next  question  ve  wish  to  investigate  is  whether  the  nuclei-  of 
indexing  terms  on  the  list  being  processed  affects  the  classification 
structure  even  though  til  other  variables  are  counterbalanced. 
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Another  way  of  phrasing  this  ease  question  is  to  ask  whether  the 
classification  structures  derived  frco  6,  15,  and  30  terms  would 
be  significantly  different  from  each  other. 

First  let  os  examine  the  effect  of  using  1 nriexl ng  lists  of  6,  15, 
and  30  terns  that  were  machine  derived,  input  into  the  WD-2  program 
in  order  fl.  Since  there  are  three  conditions,  taken  two  at  a  time, 
there  are  three  interrelationships — rows  1,  2,  and  3  of  Figure  13. 

These  first  three  rows  are  interpreted  to  mean  that  the  hierarchical 
classification  structures  derived  by  using  different  length  lists  of 
terms  have  an  essentially  zero  correlation,  and  are  therefore  not  all 
similar  to  each  ether. 

But  before  coming  to  a vy  overall  conclusion,  let  us  examine  the  next 
six  rows  in  Figure  13.  If  the  hierarchical  arrangement  of  the 
documents  in  a  classification  structure  is  primarily  dependent  upon 
the  length  of  the  index  list,  then  ve  would  expect  to  find  similar 
results  over  the  three  orders  of  input. 

The  expected  results  are  borne  out  by  an  examination  of  rows  4,  5, 
and  6,  and  rove  7,  8,  and  9*  Simply  varying  the  size  of  the  list 
while  using  machine-aided  i nrieyi ng  «nd  the  VD-2  program  will  cause 
significant  changes  in  the  cl »'-ific alien  structure. 
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Figure  13.  Correlation  of  Classification  Structures 
vheo  the  Nunber  of  Machine- Aided  Index 
Terms  It  Varied 
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There  Is  another  bit  of  data  that  is  worth  noting:  The  last  line 
in  all  three  sections— that  is  rows  3>  6,  and  9— has  the  highest 
numerical  value,  which  shows  the  greatest  similarity  between  the 
classifications  based  on  1$  and  30  terms.  This  was  a  very  tentative 
formulation— for  certainly  the  numerical  values  are  not  that  far 
apart — but  it  w&s  worth  checking. 

We  continued  to  investigate  the  significance  of  Index  lengths  to 
see  where  these-  same  relationships  held  when  we  used  the  WD-3 
classification  algorithm.  These  data  are  in  the  bottom  half  of 
Figure  13. 

The  correlation  coefficients  on  lines  10,  11,  and  12;  13,  li,  and  15; 
and  1 6,  17,  and  18  are  indeed  quite  similar  to  the  first  three  sub¬ 
sections  of  this  table,  and  we  concluded  that  the  length  of  the  index 
list  can  significantly  affect  arrangement  of  items  in  an 
automatically  derived  classification  hiexari.  /,  and  that  -nis  relation¬ 
ship  holds,  whatever  the  order  of  list  processing,  or  whether  the 
classification  algorithm  is  WD-2  or  WD-3* 

We  have  yet  to  see  whether  this  same  phenomenon  would  hold  if  the 
index  term  lists  were  derived  by  skilled  librarians  rather  than 


machine- aided  technique. 
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Let  us  examine  the  data  in  Figure  14.  Note  that  lines  1,  d,  and  3 
axe  exactly  comparable  to  the  first  three  lines  in  Figure  13,  except 
that  Figure  lh  is  based  upon  human  indexing,  while  Figure  13  contains 
machine- aided  index  lists.  Since  the  values  in  the  first  three  rows 
of  Figure  l’+  are  slightly  lower,  it  would  appear  that  classification 
structures  based  upon  human  indexing  are  more  subject  to  variation  as 
the  number  of  index  terms  per  list  is  increased  than  are  classification 
structures  based  on  machine-aided  index  lists. 

We  checked  to  see  whether  this  trend  continued  as  we  examined  additional 
data,  varying  only  the  order  of  input.  The  sets  of  correlation  coefficients 
in  the  first  three  sections  of  Figure  14  are  aln»st  identical  in  their 
values.  This  is  not  surprising,  for  the  only  attribute  varied  was  the 
order  in  which  the  lists  were  presented  to  the  program  for  processing. 

At  any  rate,  an  examination  of  these  three  sets  of  data  lends  support 
tc  our  notion  that  machine- derived  classification  structures  based 
upon  human-produced  index  lists  are  sensitive  to  the  number  of  terms  on 
the  lists— or,  stated  differently,  that  different- sized  lists  will 
produce  dissimilar  hierarchical  classifications. 

There  is  one  other  bit  of  data  that  merits  our  attention.  In  all. 
three  of  these  sections,  the  highest  coefficient  was  obtained  when 
the  classification  structures  based  upon  6-term  and  15-term  index  lists 
were  correlated  (lines  1,  4,  and  7)*  This  contrasts  markedly  with  the 
corre'-ponding  data  in  Figure  13;  where  the  highest  value  was  between 
15-  and  30- term  lists. 
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Figure  14.  Correlation  of  Classification  Structures 
when  the  Number  of  Human- Selected  Index 
Terms  Is  Varied 
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ruming  our  attention  to  the  bottom  half  of  Figure  14,  we  checked  to 
see  whether  the  WD-3  algorithm  yielded  data  that  were  similar  to  that 
obtained  by  the  WD-2  procedures.  The  lower  three  sections  reveal 
that  they  are  quite  similar  to  each  other  and  that  they  contain  higher 
values  than  those  in  the  upper  portion  of  the  figure.  Also,  the 
highest  values,  in  the  .60s,  occur  between  list  lengths  at  6  and  15 
terms. 

The  analysis  of  these  tables  provided  a  basis  for  our  answering  the 
question:  Do  automatically  derived  hierarchical  classification 
structures  based  upon  index  lists  containing  6,  15,  and  30  terms 
differ  significantly? 

The  answer  was,  clearly,  "yes'j  the  size  of  the  index  term  list  affects 
the  classification  structure,  although  the  effect  is  less  marked  when 
human  indexing  to  a  depth  of  six  terms  is  processed  by  the  WD-3  programs. 

d.  The  Effect  of  the  Order  in  Which  the  Documents  Are  Input 
for  Processing 

The  final  variable  that  we  wished  to  investigate  was  the  effect  of  the 
input  order  on  the  reliability  of  the  classification  structure.  In 
the  description  of  the  hierarchical  grouping  program  (Paragraph  l), 
we  explained  that  the  program  combines  documents  into  groups  by 
searching  the  similarity  matrix  for  the  highest  cell  value.  However, 
in  case  more  than  one  cell  has  the  same  value,  the  first  cell  position 
is  used  to  form  the  group.  It  is  thus  possible  that  the  order  of 
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input  may  have  an  effect  on  the  final  clustering  structure.  To  test 
and  evaluate  the  significance  of  the  input  order,  three  different 
arrangements  were  used  (see  Paragraph  3c).  The  results  are  shown  in 
Figures  15  and  16. 

The  order  of  input  did  cause  some  variation  in  the  final  classification 
structures,  but  by  itself,  this  was  not  a  very  significant  factor.  It 
was  also  clear  that  this  variation  was  less  for  the  WD-3  than  for  the 
WD-2  algorithm.  Furthermore,  classification  structures  based  upon 
documents  that  had  been  indexed  by  trained  indexers  using  multiple-word 
concept  terms  were  less  subject  to  variation  than  were  classification 
schemes  using  machine -derived  word  lists. 

One  additional  point  worth  noting  is  that  the  more  terms  used  to 
describe  the  document,  the  less  likely  was  the  classification  structure 
to  vary,  because  differences  in  document  Input  had  no  effect  whatsoever. 

5.  A  FACTOR  ANALYSIS  OF  THE  CORRELATION  MATRIX 

In  collecting  data  on  the  reliability  and  consistency  of  automatically 
derived  classification  structures,  we  computed  a  table  of  intercorrelation 
for  the  various  classification  structures  (Figure  10).  While  individual 
correlation  coefficients  were  interpreted,  and  the  results  discussed  in 
the  preceding  paragraphs  (Paragraphs  4a  through  4d),  we  also  analyzed 
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matrix  itself. 
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Figure  16.  Correlation  of  Classification  Structures 

when  the  Order  of  Input  for  Hunan- Selected 
Index  Term  Lists  Is  Varied 
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The  36  x  36  correlation  matrix  was  factor  analyzed  using  a  principal 
component  solution  (Harmon,  1967)*  Ten  principal  axes  were  extracted, 
six  of  which  accounted  for  68.6  percent  of  the  total  variance.  These 
were  rotated  orthogonally  for  single  structure,  and  the  results  recorded 
in  Figure  17.  Figures  18  through  22  were  derived  from  the  rotated  factor 
matrix,  but,  for  ease  of  interpretation,  we  show  only  the  significant 
loadings,  arranged  in  descending  order.  To  the  right  of  the  values 
are  listed  the  attributes  of  the  classification  structure  or  array. 

The  interpretation  of  these  factors  is  clear:  The  attributes  that 
have  a  significant  effect  cn  the  similarity  of  the  machine- derived 
classification  structure  are  primarily  the  type  of  indexing  (machine 
derived  or  human  indexing)  and  the  number  of  index  terms  used.  This 
analysis  is  supported  by  the  interpretation  of  the  factors,  which  have 
been  labeled  as  follows; 

Factor  I.  Machine  Indexing,  Long  Li6t8  (Figure  18) 

Factor  IV.  Machine  Indexing,  Short  Lists  (Figure  19) 

Factor  III,  Human  Indexing,  Long  Lists  (Figure  20) 

Factor  V.  Human  Indexing,  Short  Lists,  WD-2  (Figure  21) 

Factor  II.  Human  Indexing,  Short  .Lists,  WD-3  (Figure  22) 

Factor  VI.  Machine  Indexing,  Long  Lists,  WD-3  (Figure  23) 

Note  that  instead  of  having  a  single  factor  dealing  with  human 
indexing,  short  lists,  we  have  two--one  for  each  of  the  two 
classification  algorithms.  Note  also  that  the  last  factor  (Machine 
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Figure  17.  Rotated  Factor  Matrix 
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Indexing,  Long  Lists,  WD~3)  is  partially  redundant  with  the  first 
factor  and  begins  to  divide  that  first  factor  in  accordance  with 
the  classification  algorithm  used.  These  findings  are  consistent 
with  the  statement  made  earlier  that  the  variables  that  contribute 
most  to  the  structure  of  the  machine- derived  classification  system 
are  type  of  indexing  and  the  number  of  index  terms. 

6.  SUMMARY  OF  RESULTS  AND  CONCLUSIONS 

The  experiments  described  in  Paragraph  4  were  designed  to  investigate 
the  reliability  and  consistency  of  the  ALCAPP  programs  for  automatically 
deriving  classification  hierarchies  as  four  selected  attributes  were 
varied  under  controlled  conditions.  These  attributes  were: 

(1)  The  classification  algorithm. 

(2)  The  type  of  document  indexing. 

(3)  The  depth  of  indexing. 

(4)  The  order  of  document  input  fc:  proce-1--' 

A  total  of  36  different  combinations  were  studied,  and  36  classification 
structures  derived.  In  order  to  investigate  the  consistency  or 
similarity  of  these  structures,  each  classification  was  compared  with 
every  other  one  and  the  results  of  these  comparisons  summarized  In  a 
matrix  of  intercorrelationc  (Figure  10).  This  matrix  provides  the 
clata  for  analysis  and  interpretation. 
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a.  The  Importance  of  the  Classification  Algorithm 
Two  classification  algorithms  were  used,  and  these  are  designated 
WD-2  and  WD~3.  The  WD-2  procedure  results  in  a  classification 
structure  that  is  relatively  symmetric  and  consists  of  a  few  m*dri 
clusters  (Figure  4).  The  Wl>3  algorithm  creates  some  main  clusters 
of  similar  documents  piur  small  clusters  of  a  few  documents  each,  and 
finally,  some  clusters  of  two,  three,  or  even  single  documents.  The 
result  is  an  asymmetric  hierarchy  (Figure  6). 

Clearly,  the  classification  structures  are  going  to  be  somewhat 
different,  but  the  question  being  investigated  was  whether  the 
distributions  of  the  documents  within  the  two  hierarchical  classifi¬ 
cation  structures  were  similar— that  is,  would  documents  that  were 
put  in  the  same  cluster  by  one  algorithm  alBo  tend  to  be  close  together 
in  the  classification  structure  created  hy  use  of  the  other  algorithm? 

The  results  of  the  comparisons  shoved  that  there  was  a  slight  to  moderate 
degree  of  similarity  between  the  classification  structures  that  were 
derived  from  the  WD-2  and  WI>3  algorithms,  and  this  is  what  we  would 
expect. 

The  data  from  the  factor  analysis  support  this  conclusion,  and  arqplify 
it,  by  indicating  that  the  classification  algorithm  has  a  greater  effect 
when  the  input  lists  consist  of  relatively  few  human-derived  concept 
terms.  Under  these  circumstances,  the  WI>-2  algorithm  would  tend  to 
farce  the  doc  urgent  into  a  cluster,  while  the  WD-3  algorithm  would  tend 
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to  keep  that  document  list  separate  and  distinct — thus  creating  a 
greater  degree  of  dissimilarity  than  would  be  obtained  under  other 
combinations  of  attributes. 

At  any  rate,  the  mathematical  and  logical  techniques  used  for  making 
a  classification  structure  had  a  moderate  effect  on  the  similarity 
of  the  resulting  structures. 

b.  The  Importance  of  the  Itype  of  Document  Indexing 

Two  types  of  indexing  were  used  in  constructing  surrogate  lists  for 
input  to  the  classification  programs;  these  were: 

(1)  Concept  indexing  done  by  trained  librarians. 

(2)  Key- word  indexing  using  machine- aided  techniques. 

This  experiment  provided  clear  evidence  of  the  fact  that  these 
different  indexing  techniques  would  result  in  machine-produced 
classification  structures  that  had  little  resemblance  to  one  another. 
This  is  a  most  significant  finding,  for  it  states  that  regardless  of 
the  other  factors  involved,  document  subject  groupings  differ, 
depending  upon  whether  the  index  terms  used  are  uniterms  or  pre- 
cooruinated  suuject  healings. 

c.  The  Importance  of  Depth  of  Indexing 

Bach  document  was  indexed  by  6,  15,  and  30  terms.  'T’he  question  being 
investigated  was  whether,  other  things  being  equal,  classification 
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structures  would  differ  significantly  as  a  result  of  the  nuaiber  of 
index  terns  used. 

The  results  cf  the  individual  comparisons  and  of  the  factor  analysis 
clearly  indicated  that  classification  structures  derived  by  using  long 
lists  of  index  terms  differ  significantly  from  the  bucm-tures 
derived  try  using  short  lists. 

Again,  this  is  a  rather  important  finding,  for  it  points  to  the  danger 
of  intermingling  depth  indexing  with  shallow  indexing  when  organizing 
document  collections. 

d.  The  Importance  of  the  Order  of  Document  Input 
The  particular  procedures  used  in  creating  clusters  of  documents  can  be 
affected  by  which  documents  are  used  for  creating  the  original  groupings. 
To  determine  the  Importance  of  this  variable  on  the  similarity  of  the 
resultant  classification  structure,  the  input  order  was  varied  nd  the 
errects  studied. 

The  results  showed  that  the  effect  of  input  order  was  not  very  significant, 
and  that  *he  classification  structures  derived  by  using  different  orders 
of  input  were  quite  similar. 

From  a  practical  point  of  view,  this  finding  is  important  because,  if  the 
processing  order  can  be  ignored  (as  indeed  it  can),  the  classification 
algorithm  can  be  simplified. 


SECTION  V 


MEASURING  THE  STABILITY  OF  AUTOMATICALLY- 
DERIVED  DOCUMENT  GROUPINGS 

A  manual  classification  system  relies  on  human  Ingenuity  to  insure 
flexibility  as  new  and  related  items  are  added  to  the  document  collection 
and  to  do  this  in  such  a  way  that  the  stability  of  the  original  structure 
is  maintained.  Nevertheless,  ad  1  classification  systems  tend  to  become 
rigid  over  a  period  of  time,  and  when  significant  changes  are  made  in 
the  character  of  the  collection,  their  efficiency  is  reduced,  for  the 
systems  cannot  be  revised  radically  except  at  great  cost. 

One  of  the  unique  advantages  of  automated  document  classification  is  that 
the  entire  collection  of  materials  can  be  reclassified  periodically  and 
relatively  inexpensively.  However,  it  is  important  that  even  with 
reclassification  a  certain  stability  and  consistency  be  maintained. 
Documents  that  have  been  previously'  grouped  together  should  not,  in  the 
reorganized  system,  appear  to  be  unrelated. 

Reclassified  '.on  t„.  means  of  v.c  ALCAPr  cluster- finding  programs  has 
been  demonstrated  to  be  relatively  simple  and  inexpensive.  Cost  goes 
up  linearly  with  the  number  of  documents  being  processed  rather  than 
exponentially,  as  is  the  case  of  some  procedures.  However,  the  stability 
of  the  classif icatlonr  ..eeds  tc  be  evaluated.  In  order  tc  do  so,  two 
series  of  experiments  were  designed.  The  first  set  investigated  the 
sensitivity  of  the  ALCAPr  clustering  algorithm  tc  changes  in  initial 


cluster  assignments,  and  the  second  investigated  the  stability  of  the 
classification  system  as  the  size  of  document  collection  is  increased 
incrementally . 

1.  DESCRIPTION  OF  THE  CLUSTER- FINDING  ALGORITHM 

The  data  on  which  the  program  operates  are  a  set  of  word  lists  derived 
from  documents.  These  words  may  be  assigned  by  human  indexers  or  by 
computer.  In  these  experiments,  the  basic  document  collection  consists 
of  600  abstracts,  and  each  of  these  is  reduced  to  a  list  of  30  terms 
by  machine  methods,  an  described  in  Section  III,  paragraph  2, 

The  ALCAFP  cluster- finding  algorithm  is  an  iterative  procedure,  which 
starts  with  an  arbitrary  assignment  of  documents  to  an  arbitrary  number 
of  clusters.  Then  with  each  iteration,  documents  Judged  to  be  similar 
on  the  basis  of  the  word  lists  are  grouped  together,  and  previously 
unassigned  documents  are  added  to  the  clusters.  A  detailed  description 
of  our  procedure  follows. 

Input  Stage:  We  began  by  choosing  a  reasonable  number  of  clusters 
into  which  the  total  collection  can  be  divided.  The  actual  number  of 
categories  in  the  final  classification  scheme  aay  be  less  than  this 
upper  bound,  depending  on  the  differences  in  content  among  the  documents. 
In  all  of  these  experiments  the  initial  number  of  categories  was  set  at 


After  the  number  of  categories  to  be  used  for  the  initial  iteration  was 
determined,  a  set  of  documents  was  assigned  to  each  cluster,  but  no  document 
was  assigned  to  more  than  one  cluster.  Twenty  dec  reerts  were  so  assigned. 
Both  the  number  of  documents  and  their  selection  were  arbitrary.  We 
wished  to  choose  a  reasonable- sized  sample,  but  at  the  same  time  we 
wished  to  maintain  a  large  pool  of  unassigned  documents,  for  these  help 
to  differentiate  and  separate  the  categories.  As  will  be  seen,  the  program 
shifts  documents  from  their  originally-  assigned  categories  and  brings  in 
new  documents  from  the  unass igr.ed  pool. 


Firsn  Iteration; 

The  ini dividual  word  ii_,ts  in  each  cluster  were  combined  into  a  single 
composite  list  or  dictionary  of  all  of  the  terms  and  their  frequency  of 
occurrence.  The  dictionary-  list  was  then  rearranged  so  that  toe  most 
frequently  occurring  words  were  listed  on  top  and  the  other  words  followed 
or.  a  descending  order  of  frequency. 


Weights  were  assigned  to  each  term  on  the  If 
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Four  other  constraints  were  imposed  on  the  program  for  assigning 
weights: 

(a)  The  highest  value,  or  greatest  possible  weight  that  can  be  assigned 
to  ary  term  was  63.  A  cluster  set  that  contains  more  than  63 
document  lists  will  nonetheless  have  63  as  the  maximum  weight. 

(b)  All  terms  with  the  same  frequency  were  assigned  the  same  weight. 

(c)  No  tarn  was  assigned  a  negative  or  zero  weight.  Thus,  if  more 
frequency  classes  exist  than  the  highest  weight,  all  lower 
frequencies  will  be  assigned  a  weight  of  1. 

(d)  Words  that  occurred  only  once  and  thus  had  a  word  frequency  of  one 
were  automatically  assigned  a  weight  of  1. 

An  abbreviated  example  of  weight  assignment  is  shown  in  Figure  2k. 

The  first  iteration  concluded  with  the  assignment  of  weights  to  each 
term  in  the  dictionary  list.  The  resulting  list  of  weighted  terns  was 
called  the  cluster  profile;  one  profile  was  constructed  for  each  cluster. 

Second  Iteration: 

At  the  start  of  the  second  iteration,  a  cluster  profile  existed  for  each 
cluster  or  category.  This  profile  consisted  of  a  list  of  all  the  terms 
appearing  in  the  document  word  lists  in  a  given  cluster,  plus  their 
assigned  weights.  Each  cluster  had  its  own  profile. 
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Conputer  processing  for  the  second  iteration  began  by  assigning  a  score 
on  each  profile  to  every  document  in  the  entire  collection.  If  there 

were  ten  categories,  each  document  was  assigned  ten  profile  scores. 

A  profile  score  is  the  arithmetic  sum  of  the  weights  assigned  to  terms 
in  a  profile  that  occur  in  the  documents’  list  o?  index  terms.  A  ratio 
between  the  highest  profile  score  and  the  next  highest  score  was  also 
confuted.  The  document  was  then  tentatively  assigned  to  the  cluster 
profile  on  which  it  received  the  highest  score.  It  is  at  this  point 
that  initially  created  categories  can  disappear.  This,  indeed,  happened 
in  the  experiments  described  below.  Of  the  600  documents  in  the  collection, 
no  document  received  its  highest  score  in  a  particular  category.  As  a 
result,  no  documents  were  assigned  to  that  category;  so  Instead  of  ten 
profile  clusters  there  were  only  nine- 

A  list  was  made  of  the  documents  assigned  to  each  category.  This  list 
was  sorted  on  the  profile  score  ratio  that  had  been  computed  previously, 
and  the  document  identification  numbers  rearranged,  so  that  the  one  with 
the  highest  ratio  value  appeared  on  top  and  the  rest  were  listed  in 
descending  order.  The  top  N  +  l/2N  documents  were  assigned  to  each 
cluster  and  all  other  documents  were  listed  in  tie  uratsigned  pool.  N  is 
the  number  of  documents  assigned  to  a  cluster  in  the  previous  iteration. 
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By  limiting  the  number  of  new  documents  that  could  be  assigned  to  a 
cluster  in  any  one  iteration  to  l/2  N,  we  could  separate  the  clusters 
into  distinct  subject  categories  and  add  new  documents  gradually. 

It  is  perhaps  obvious  that  although  no  more  than  N  +  l/2  n  documents 
can  be  assigned  to  a  category  at  each  iteration,  this  does  not  mean 
that  many  documents  will  be  assigned.  Each  document  receives  many 
profile  scores  and  is  tentatively  assigned  to  the  category  in  which 
it  has  the  highest  score.  Clearly,  more  documents  could  be  assigned 
to  one  category  than  to  another. 

Subsequent  Iterations; 

The  iterative  process  was  repeated.  New  document  profile  scores  were 
confuted  for  all  the  documents  in  the  collection,  together  with  their 
appropriate  ratios.  Tentative  assignments  of  documents  was  made  to 
the  most  likely  category;  these  were  re-sorted  by  ratio  score,  additional 
documents  added  to  the  category,  etc. 

The  iterative  process  continued  until  (a)  every  document  had  been 
assigned  to  a  category,  and  (b)  the  new  set  of  clusters  was  exactly 
the  same  as  the  set  obtained  from  the  previous  iteration.  That  i6  to 
say,  the  clusters  were  stable,  the  iterative  process  converged,  and 
no  document  changed  cluster  assignment  from  one  iteration  to  the  next. 
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In  the  clustering  experiments  described  below,  the  algorithm  was 
modified  slightly  to  provide  on  additional  constraint,  in  order  to 
prevent  one  cluster  from  being  assigned  all  the  documents  in  the  set. 

This  modification  was  necessary  because  of  the  essential  homogeneity 
of  the  collection,  e.g. ,  all  the  documents  were  on  the  subject  of 
information  science.  The  algorithm  was  modified  so  that  no  cluster 
could  be  assigned  more  than  90  documents,  until  there  were  no  changes 
in  cluster  assignment  from  one  iteration  to  t Is  next,  but  before  all 
documents  in  the  collection  had  been  assigned  to  a  cluster. 

2.  DETERMINATION  OF  THE  SENSITIVITY  OF  THE  ALCAPP  CLUSTERING 
ALGORITHM  TO  CHANGES  IN  INITIAL  CLUSTER  ASSIGNMENTS 

In  devising  classification  schemes,  be  they  manual  or  automatic,  one 

finds  that  the  character  of  the  original  set  of  documents  plays  a 

disproportionately  important  role  in  determining  the  nature  of  the 

subject  categories  into  which  the  rest  of  the  documents  in  the  collection 

will  be  divided.  This  is  perhaps  especially  true  when  using  the  ALCAPP 

algorithm,  which  begins  with  the  arbitrary  assignment  of  a  number  of 

documents  to  each  cluster.  Yet,  ideally,  it  would  be  desirable  that 

the  final  classification  be  the  same  regardless  of  which  documents  were 

used  in  the  original  cluster  assignments.  These  experiments  were  designed 

to  determine  how  far  reality  departs  from  this  ideal. 
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a.  Purpose 

The  purpose  of  this  experiment  was  to  determine  the  sensitivy  of  the 
final  classification  structure  to  differences  in  the  initial  assignment 
of  documents  to  clusters. 

b.  Method 

The  experimental  data  set  consisted  of  600  documents  and  their  surrogates 
of  600  word  lists,  each  containing  30  key-word  terms  derived  by  machine 
analysis  of  the  document  abstract.  At  the  start  of  the  ALCAPP  processing 
procedures,  10  categories  were  created  and  20  documents  assigned  to  each 
category.  Then  the  program  divided  the  collection  into  clusters,  as 
described  in  the  previous  section.  The  documents  in  the  initial  cluster 
were  assigned  randomly,  the  only  constraint  being  that  a  document  could  be 
assigned  to  a  starting  cluster  only  once  in  the  entire  experiment. 

The  clustering  procedure  was  repeated  three  times,  creating  classification 
structures  A,  B,  and  C.  In  all  cases,  one  category  was  eliminated  by  the 
program;  thus,  all  three  structures  contained  nine  categories  each.  Since 
there  was  no  reason  to  expect  that  any  given  cluster  in  one  classification 
would  correspond  to  any  particular  cluster  in  a  second  classification,  each 
cluster  in  a  classification  was  compared  to  every  cluster  in  the  second 
classification.  The  comparison  consisted  simply  in  noting  the  number  of 
documents  that  the  clusters  from  differing  classifications  had  in  common. 
Hence,  we  arrived  at  a  9  x  9  matrix  for  the  conparison  of  any  classification 
to  any  other.  Since  the  number  of  documents  in  any  cluster  was  known,  it 
was  possible  to  compute  the  expected  value  of  any  cell  in  the  matrix. 


it 


assuming  only  random  similarity  of  document  aesigtwats.  By  cockering 
the  expected  values  to  the  observed  values,  both  chi'  square  and  phi 
could  be  computed  for  the  entire  matrix. 

The  three  matrices  are  shown  in  Figures  25,  26,  and  27. 

c.  Findings  and  Interpretations 

For  a  matrix  of  this  size,  the  number  of  degrees  of  freedom  is  64 

p 

and  the  expected  value  of  X  is  128.  Clearly,  the  observed  values  of 
908,  946,  and  1263  are  significant  beyond  any  chance  expectation. 

Fhi  is  an  index  roughly  equivalent  to  a  correlation  coefficient,  with 
minimum  zero  and  an  unpredictable  maximum  in  the  neighborhood  of  1.  The 
average  value  observed  here,  .464,  confirms  what  a  visual  inspection  of 
the  similarity  matrices  suggests*  that  while  generally  no  one  cluster 
in  a  classification  can  be  unambiguously  assigned  to  a  given  cluster 
in  a  second  classification,  <*  documents  in  the  first  cluster  are  not 
distributed  randomly  among  the  clusters  in  the  second  classification; 
the  bulk  of  the  distribution  tends  to  be  concentrated  in  two  or  three 
clusters. 

The  three  classifications  structures  are  only  moderately  similar,  but 
the  similarity  that  exists  is  not  the  result  of  chance;  it  is  statistically 
very  significant. 
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Figure  25.  Matrix  Cotparing  the  Categories  in  Classification  A 
with  B 
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Nevorthe1  eus,  based  upon  the  obtained  results,  it  is  recommended  that 
if  the  ALCAPP  automatic  classification  programs  are  to  be  used  in  a 
practical  way,  then  some  sort  of  'deeding'  process  must,  be  used  for  the 
initial  assignment  cf  documents  to  clusters.  The  probable  success  of 
such  an  approach  is  suggested  by  a  second  experiment  described  below. 

3.  DETERMINATION  OF  THE  SENSITIVITY  OF  THE  ALCAPP  CLUSTERING 
ALGORITHM  BY  THE  ADDITION  OF  DOCUMENTS  TO  A  PREVIOUSLY 
CLASSIFIED  SET 

All  classification  systems  are  organized  on  the  basis  of  an  initial 
collection  of  documents.  Afterwards,  even  though  many  new  documents 
are  added  to  the  collection,  the  original  set  of  categories  Is  expected 
to  be  stable.  In  manual  systems,  logical  orgarlzing  principles  are 
used  and  stability  is  assured  by  the  ingenuity  of  human  classifiers. 

In  an  automated  system,  the  basis  of  classification  is  the  similarity 
of  the  words  used  in  the  document  or  assigned  to  the  word  list 
characterizing  that  document.  Instead,  of  being  able  to  rely  on  the 
ingenuity  of  a  trained  librarian,  we  must  rely  on  the  logic  of  the 
computer  program.  Granted  that  there  are  differences  in  procedures 
as  veil  as  advantages  and  disadvantages  on  both  sides,  the  fact 
remains  that  any  document  classification  system  must  be  reasonably 
stable  over  time  and  as  new  documents  are  added  to  the  collection.  If 
there  is  no  stability,  it  would  be  impossible  to  learn  to  use  the 
system,  ana  the  advantages  of  classification  would  be  nullified. 
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a.  Purpose 

The  purpose  of  this  experiment  was  to  determine  Just  how  sensitive  the 
ALCAPP  classification  algorithm  was  to  the  addition  of  new  documents. 

A  classification  system  is  stable  when  additions  can  be  made  to  the 
already  established  classification  categories,  and  the  documents  that 
have  been  previously  assigned  to  one  cluster  will  not  be  reassigned 
to  a  different  cluster. 


b.  Method 

Using  the  cluster- finding  algorithm,  500  of  the  600  documents  in  the 
collection  were  classified  into  8  clusters.  Note  that  in  this  experiment, 
in  contrast  with  the  one  discussed  in  paragraph  2,  only  eight  clusters 
remained,  rather  than  nine,  a1  though  both  experiments  started  with  an 
initial  assignment  of  ten  clusters.  The  probable  reason  for  this 
difference  is  that  in  the  present  instance,  500,  not  oOO  documents 
were  classified. 


The  distribution  cf  the  500  documents  in  the  6  clusters  constituted 
the  initial  cluster  assignments.  To  test  the  stability  of  this 
classification,  10,  25,  75,  and  100  documents  were  added  to  the  original 
500  and  the  program  was  iterated  until  the  standard  termination  conditions 
•..ere  reached-- that  is  until  all  the  documents  had  been  assigned,  and  no 
document  changed  assignment  from  one  cycle  to  the  next.  We  cor  red  each 
of  the  resulting  classifications  to  the  "baseline  classification  by 
counting  the  number  of  the  original  500  documents  that  had  been 
reassigned  to  a  liffeicnt  cluster.  In  addition,  each  classification  v»r 
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Compared,  to  every  other  classification  in  order  to  compute  stability 
as  a  function  ci  the  number  of  documents  added. 

c.  Findings  and  Interpretations 

The  f tr 6t  results  of  the  experiment  are  described  in  two  tables. 

Figure  28  records  the  number  of  the  original  500  documents  that  changed 
assignments  when  10,  25,  or  more  documents  were  added  to  the  collection. 
Figure  29  is  in  the  form  of  a  diagonal  matrix  and  records,  for  each 
pair  of  conditions,  the  proportion  of  documents  that  had  different 
assignments  in  the  two  classifications  considering  only  those  documents 
that  the  pair  had  in  common. 

A  cursory  glance  at  the  data  in  botn  figures  reveals  that  a  significant 
difference  occurred  when  75  documents  were  added.  The  number  of  changing 
assignments  jumped  from  18,  when  50  new  documents  were  added,  to  235# 
when  75  documents  were  added,  and  the  percentage  change  went  from  about 
4  percent  to  approximately  45  percent. 

Two  possibilities  could  account  for  this  dramatic  change:  either  there 
is  a  major  difference  in  the  content  of  the  last  25  documents  added,  or 
the  algorithm  itself  becomes  less  stable  when  more  than  10  percent  of 
the  original  data  base  is  added  at  one  time.  Since  the  documents  ware 
selected  at  random  from  a  homogeneous  data  base,  it  is  unlikely  that 
there  is  a  real  difference  in  the  content  of  the  documents.  This  being 
the  case,  the  workings  of  the  algorithm  itself  needs  to  be  studied  and 


evaluated. 
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Aa  additional  experiment  was  designed  to  compare  the  stability  of  the 
classification  system  when  75  new  documents--raore  than  10  percent— 
were  added  to  the  500,  as  compared  with  the  addition  of  the  last  25 
documents  to  a  starting  classification  containing  550  documents.  The 
first  set  of  conditions  was  simply  a  restatement  of  the  previously- 
followed  procedure  in  which  235  documents  changed  cluster  assignments. 

In  the  second  technique,  the  starting  classification  contains  550 
documents  and  not  500.  To  these  550  documents,  25  were  added— less  than 
10  percent— and  the  entire  group  of  575  was  reclassified.  These  two 
procedures,  each  containing  the  same  number  of  documents  in  the  final 
classification,  could  then  be  compared. 

The  results  of  this  experiment  are  contained  in  the  two  matrices 
illustrated  in  Figures  30  and  31.  The  question  being  investigated  was; 
How  many  of  the  original  500  documents  change  cluster  assignments  when 
75  new  documents  are  added  to  an  initial  classification  structure 
containing  500  documents,  as  compared  with  adding  the  same  75  documents 
in  two  stages,  first  50  and  then  25?  (Fifty  was  chosen  as  the  starting 
point  since  the  set  of  550  resulted  in  the  last  stable  configuration.) 

Note  that  the  number  of  documents  changing  category  assignments  in 
Figure  29  is  large,  while  in  Figure  30  relatively  few  documents  change 
cluster  assignments.  This  is  interpreted  to  mean  that,  using  the  ALCAPP 
algorithms,  no  more  than  ten  percent  of  the  new  documents  should  be  added 
to  an  existing  classification  structure  at  any  one  time.  Under  these 
conditions, the  basic  classification  structure  remains  rencouably  stable. 
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Cluster  Distribution  of  the  Experimental  Set 
of  500  Documents:  Total  Set  *  500 
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Figure  30.  Matrix  Showing  the  Number  of  Documents 
Changing  Categories  when  the  Number  of 
Additional  Documents  Exceeds  Ten  Percent 
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Cluster  Distribution  of  the  Experimental  Set 
of  500  Documents:  Total  Set  *  550 
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4.  DESCRIPTION  OF  THE  CLUSTERS 

As  an  adjunct  to  these  experiments  on  stability,  one  of  the  members 
of  the  project  staff  examined  the  initial  classification  of  the  500 
documents  to  see  if  the  clusters  seemed  "reasonable,"  i.e.,  whether 
there  was  a  unifying  theme  shared  by  the  documents  classified  in  the  same 
category.  It  was  recognized  at  the  outset  that  this  process  is  highly 
subjective,  and  it  was  undertaken  only  to  make  some  estimate  of  the 
reasonableness  of  the  classification. 

The  results  of  this  perusal  are  both  satisfying  and  disappointing; 
the  categories  make  sense  but  they  are  not  cohesive.  Most  of  the 
clusters  contained  a  "core"  of  documents  that  were  indeed  highly  related 
and  could  sensibly  be  classified  together.  On  the  other  band,  two 
effects  were  noted  that  are  not  reasonable.  First,  many  of  the  documents 
in  a  cluster,  say  10  to  20  percent,  seem  misplaced  in  the  sense  that  they 
would  appear  to  fit  more  appropriately  in  another  of  the  8  clusters. 
Second,  certain  topics  that  seem  as  though  they  ought  to  form  distinct 
clusters  do  not,  and  are  scattered  through  all  the  clusters.  Bcanjples 
of  this  latter  case  are: 

(a)  documents  related  to  "artificial  intelligence"; 

(b)  documents  related  to  "writing  style"; 

(c)  documents  related  to  "machine  translation." 

One  possible  explanation  for  this  effect  is  that  there  are  relatively 
few  documents  in  these  categories — 20  to  25.  Nonetheless, 


one  could 
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hope  that  all  documents  pertaining  to  a  given  topic  might  have  been 
assigned  to  the  same  cluster. 

A  description  of  the  eight  clusters  follows,  but  in  these  interpretations 
only  the  "core"  subject  area  or  areas  are  described,  with  some  indication 
as  to  their  purity: 

Cluster  1;  42  documents; 

Automated,  computer- oriented  information  retrieval  systems.  Fairly 
cohesive  cluster.  Oddly  enough,  a  fair  number  of  documents  pertaining 
to  medical  information  retrieval  systems,  which  might  have  fitted  better  in 
Cluster  2,  wound  up  her..  The  choice  is  fairly  arbitrary,  in  that  the 
medical  systems  described  are  machine- oriented. 

Cluster  2:  9^  documents; 

Descriptions,  or  descriptions  and  evaluations,  of  working  IR  systems. 
("Current  awareness,"  "Documentation  Dissemination,"  etc.)  Methods  of 
evaluation  of  IR  systems. 

Cluster  3:  57  documents; 

Library  automation —  shelf  lists,  document  control,  accessions,  etc. 

Library  cataloging  operations--manual  or  machine. 

Impurities;  'Automatic  text  processing"; 

Doc ;ime nts  relating  to  "costs." 


% 
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Cluster  k:  67  documents; 

Technical  consnuni cation; 

Communication  networks; 

A  reasonably  homogeneous  cluster. 

Cluster  5:  100  documents; 

A  fairly  mixed  group  containing  documents  on  reproduction  methods, 
publication,  hardware  descriptions,  and  chemical  IB  systems. 

Cluster  6:  93  documents; 

Educational  libraries  (i.e.,  various  school  libraries); 

The  education  of  librarians  (library  school  curricula,  etc.); 
Professional  aspects  of  librarians; 

Specialized  Information  Centers  (medical,  agricultural,  etc.). 
Cluster  7:  30  documents; 

Document  representat ion — thesauri,  Indexing  classification,  etc.; 
A  fairly  cohesive  groijp  of  documents. 

Cluster  8:  17  documents: 

No  easily  discernible  pattern,  but  generally  concerned  with 


representation  nethods--abstracting  and  indexing. 


5.  SUMMARY  OF  RESULTS  AND  CONCLUSIONS 


A  series  of  experiments  was  conducted,  to  measure  the  stability  of 
automatically  derived  document  groupings.  The  first  experiment  was 
designed  to  determine  the  sensitivity  of  the  clustering  algorithm  to 
changes  in  the  initial  assignment  of  documents  made  at  the  start  of 
the  program.  Three  classification  structures  were  derived  using 
different  starting  assignments.  These  were  compared  and  found  to  be 
only  moderately  similar,  which  indicates  that  the  algorithm  is  sensitive 
to  changes  in  initial  clustering  assignments.  It  is  recommended  that 
if  the  ALCAPP  automatic  classification  programs  are  to  be  used  in  a 
practical  situation,  the  documents  selected  for  the  initial  cluster 
assignments  be  selected  with  a  view  toward  achieving  a  reasonable 
cluster  separation. 

The  second  experiment  was  aimed  at  determining  the  stability  of  the 
classification  structure  as  new  documents  were  added  to  the  collection. 

When  a  larger  percentage  of  documents  were  aaued,  the  algorithm  was 
not  stable.  It  is  therefore  recomaended  that  In  a  practical  situation 
no  more  than  ten  percent  of  new  documents  be  added  to  the  existing 
classification  structure  at  any  one  time. 

Finally,  the  documents  In  the  clusters  were  examined  to  determine  whether 
the  clusters  appeared  to  be  cohesive  and  reasonable  from  a  content  analysis 
point  of  vlev.  The  results  show  that,  while  the  automatically  created 
clusters  are  statistically  reliable  and  definite!)  not  random. 
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the  grouping  of  documents  by  content  is  imperfect.  If  the  automated 
classification  structure  is  to  be  used  for  manual  search,  as  well  as 
in  a  conjmter  retrieval  system,  we  recommend  that  the  clustering 
algorithm  be  used  to  provide  the  initial  rough  grouping  of  the  documents 
that  can  then  be  fairly  easily  modified  and  made  more  rational  by  a 
trained  librarian. 
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SECTION  VI 

MEASURING  THE  l/TILTH  OF  AUTOMATED 

DOCUMENT  CLASSIFICATION  HIERARCHIES 

A  classification  system  is  designed  to  help  people  locate  documents 
that  are  relevant  to  their  interests,  and  to  do  so  efficiently.  If 
no  classification  system  is  available,  one  ha6  to  make  a  serial  search 
through  the  entire  collection  looking  for  a  given  subject,  a  given 
title,  or  a  given  author.  Such  a  comprehensive  search  is  time- 
consuming,  and  the  usefulness  of  the  classification  system  is  shown  in 
its  injprovement  in  the  speed  and/or  accuracy  of  the  retrieval  process. 

The  fact  that  a  classification  system  can  reduce  the  time  required  to 
search  a  data  base  is  inherent  in  the  logic  built  into  the  search 
strategy.  By  dividing  the  total  document  collection  into  sections, 
only  those  categories  relevant  to  the  request  are  searched  and  ail  other 
portions  of  the  data  base  are  ignored;  thus,  search  tine  is  reduced.. 

While  it  is  obvious  that  one  can  les: en  search  time  by  s ear c ring 
f ever  documents,  one  may  not  be  searching  only  the  relevant  portions  of 
the  data  base:  the  classification  system  thus  serves  no  useful  purpose. 

On  the  small  data  base  being  used  for  these  experiments  it  was  impossible 
to  make  a  significant  saving  in  search  time,  cr  even  to  demonstrate  now 
the  automatic  classification  programs  divided  the  collection  into  clusters 
that  make  search  and  retrieval  more  efficient.  The  value  of  tac 
clustering  technique  must  be  tested  and  demonstrated  indirectly.  This 
can  be  done  by  comparing  people's  judgments  of  the  content  of  the  original 
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document  when  only  the  index  terms  that  characterize  that  document 
are  known  and  when  both  the  index  terms  and  the  classification  category 
to  which  the  document  belongs  are  known.  If  it  can  be  demonstrated 
that  the  accuracy  of  judging  document  content  improves  when  documents 
are  classified  as  well  as  indexed,  then  one  can  infer  that  classification 
is  an  aid  in  judging  the  relevancy  of  a  document  surrogate  and,  thus, 
an  aid  in  document  retrieval. 

1.  COMPARING  DOCUMENT  REPRESENTATIONS1 

There  is  a  need  for  better  answers  to  one  question  of  long-standing 
interest  to  persons  trying  to  improve  document  searching  systems: 

Given  a  proposed  revision  in  a  document- representation  technique,  how 
can  it  be  determined  whether  the  proposed  change  will  effect  an 
improvement?  A  very  important  part  of  the  answer  to  this  question 
depends  on  whether  the  proposed  revisions  will  actually  result  in  more 
adequate  representations  of  the  documents  and  the  information 
requirements  statements  input  to  the  system.  Thus,  the  empirical 
methods  used  to  test  the  adequacy  of  representations  are  important 
In  guiding  the  evolution  of  document- searching  methods,  and  this  means 
that  such  testing  methods  ought  to  be  scrutinized  regarding  their 
strengths  and  weaknesses  and  their  potentialities  for  yielding  additional 
insights  into  the  processes  of  document  representations  and  searching. 

1The  investigator  wishes  to  acknowledge  the  contribution  of  Richard  Weis 
to  these  utility  experiments  and,  particularly,  to  the  discussion  that 
follows  on  document  representations,  modeling  and  scaling,  as  well  as 
his  help  in  the  statistic al  analysis  of  the  results. 
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Basic  to  the  idea  of  adequacy  of  representation  is  the  notion  of  the 
representation  process  itself.  The  concept  of  representation  is 
deceptive  in  its  apparent  simplicity,  for  there  is  no  widely  accepted 
consensus  as  to  the  precisely  defined  limits  of  this  process.  One 
issue  is  the  purpose  of  the  representation.  The  purpose  may  be,  for 
exanple,  to  inform  the  user  about  the  contents  of  a  document;  to 
indicate  what  the  contents  are;  to  provide  the  basis  of  an  accurate, 
sensitive  search  for  stored  documents;  to  allow  the  user  of  a 
representation  to  make  the  same  interpretations  that  he  would  make  if 
given  the  full  document;  or  to  allow  the  user  of  representations 
to  make  the  same  distinctions  that  he  would  make  between  the  full 
documents. 

It  is  inport  ant  to  note  that  these  purposes  are  not  the  same  and  that 
representations  made  with  one  of  these  purposes  in  mind  may  not  necessarily 
fit  a  different  purpose.  For  this  study,  we  chose  to  look  at  repre¬ 
sentations  in  terns  of  their  ability  to  allow  the  user  to  make  the 
same  distinctions  between  representations  that  he  would  make  between  the 
full  documents. 

The  basic  purpose  of  the  utility  study  was  to  evaluate  how  useful  a 
selected  set  of  document  representations  would  be  as  an  aid  in  the 
retrieve!  function.  The  representations  used  were:  a  set  of  index 
terms  produced  by  the  conputer,  and  the  conputer-produced  index  term6 
coupled  with  each  of  two  types  of  classification  produced  by  the 
Hierarchical  Clustering  Program.  The  details  of  the  production  of  these 
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representations  are  discussed  elsewhere  in  this  report  (Ejection  III). 

The  analytic  techniques  used  in  part  of  the  study  are  described  in 
Paragraph  3*  Data  Analysis.  However,  the  analysis  and  the  conceptual 
model  are  so  interrelated  that  it  is  necessary  to  discuss  both.  We 
shall  first  sketch  out  a  psychological  model. 

a.  Psychological  Model  of  Multi attributed  Objects 
In  the  discussion  to  follow,  a  stimulus  object  refers  to  a  thing  in 
real  world;  its  attributes  are  the  things  that  describe  the  object. 

A  stimulus,  on  the  other  hand,  refers  to  a  construct,  roughly  the 
set  of  values  of  the  attributes  of  a  stimulus  object. 

Every  stimulus  object  has  an  uncountable  number  of  attributes;  however, 
in  making  distinctions  between  objects,  only  a  relatively  few  are 
involved.  These  are  determined  by  the  setting  or  context  in  which 
the  comparisons  are  being  made.  Clearly,  they  can  include  only 
attributes  on  which  the  objects  differ.  Others,  although  the  objects 
differ  on  them,  will  have  no  relevance  to  the  comparisons.  Further, 
some  attributes  may  be  more  inport ant  than  others;  that  is,  they  may 
loom  larger  in  the  comparison.  For  example,  in  the  document  area,  grade 
of  paper  may  be  irrelevant  and  writing  style,  though  relevant,  may  be 
secondary  to  other  considerations. 

It  is  an  assumption  of  the  model  that  the  objects  are  measurable  with 
respect  to  their  attributes;  in  other  words,  it  is  possible  to  make 
numerical  assignments  to  the  objects  that  reflect  the  'amount'  of  the 
attribute  they  have. 
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Since  the  objects  may  vary  independently  cn  each  of  their  attributes, 
a  spatial  modal  with  the  attributes  as  orthogonal  axes,  forming  a 
basis,  is  a  natural  extension. 

In  such  a  model,  the  stimuli  are  points  in  the  space  and  the  projections 
of  the  points  on  the  basis  vectors  are  the  values  of  the  stimuli  on  the 
related  attributes. 

The  similarity  of  stimuli  is,  then,  a  function  of  the  distance  between 
them.  The  form  of  the  distance  function  is  not  specified  conpletely, 
and  any  distance  function  that  satisfies  the  Minkowsky  inequality  is 
a  possible  candidate.  Most  of  the  early  work  with  this  model  assumed 
a  Euclidean  distance  function.  In  this  work,  the  Euclidean  model  was 
vised  as  a  first  approximation.  For  a  detailed  discussion  of  the  forms 
of  the  metric  and  the  psychological  interpretation  of  metrics  other  than 
Euclidean  see  Shepard's  article  in  the  Journal  of  Mathematical  Psychology 

(1965). 

b.  Multidimensional  Scaling 

If  one  has  all  the  stimulus  objects'  scale  values  on  all  tbs  rele¬ 
vant  dimensions,  it  is  a  simple  computational  task  to  determine 
the  distance  between  the  points.  However,  except  in  a  few  perceptual 
domains  that  have  been  extensively  studied,  for  example,  color  vision 
(Helm,  1959,  and  Helm  and  Tucker,  1962),  this  information  is  not 
generally  available.  Yet  it  is  possible  for  Judges  to  scale  objects 
in  terms  of  their  similarity  without  reference  to  particule**  attributes. 


It  should  then  be  possible  to  recover  the  underlying  dimensions. 
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Thex’e  are  two  approaches  to  the  analysis  of  similarities;  the  older 
derives  from  factor  analysis,  the  newer  from  regression  analysis. 

The  factor  analysis  approach  depends  on  the  ability  to  obtain  from 
a  set  of  similarity  judgments  a  matrix  that  can  be  factored. 

The  newer  methods  are  all  related  to  an  algorithm  devised  by 
Shepard  (1962a,  1962b).  The  rationale  for  the  Shepard  procedure 
comes  from  an  interesting  proof,  by  Abelson  and  Tukey,  that  nonmetric 
aspects  of  the  data  can  very  closely  determine  a  metric  function. 

Abelson  and  Tukey  (1959>  19^3)  show  that,  if  a  set  of  data  can  be 
fitted  to  a  so-called  ordered  metric,  that  is,  roughly,  a  ranked 
sequence  in  which  also  the  first  differences  are  ranked,  there  is  a 
strategy  that  will  fit  a  metric  to  the  data  that  will  correlate  very 
highly  with  the  'true'  metric.  To  paraphrase  their  finding,  the 
constraints  implied  by  the  ranking  of  the  first  differences  are  such 
that  the  possible  positions  of  a  given  point  that  will  preserve  the 
ordering  are  all  very  close,  sc  that  only  a  very  limited  class  of 
distance  functions  can  satisfy  the  inequalities  iaplied  by  the 
ordered  metric  and  that  any  two  of  these  distance  functions  will  be 
very  highly  correlated. 

The  Shepard  method  and  related  methods  assume  that  the  similarity 
judgments  are  at  least  monotonically  related  to  the  distance  function. 
Consider  a  regression  problem  where  one  variable  is  the  rank 
of  the  distance  between  each  pair  of  points  and  the  other  is  the 
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distance;  the  problem  i6  to  find  a  distance  function  that  minimizes 
the  mean  square  error  in  the  distance  and  preserves  the  order  of  the 
distances. 

The  Shepard  algorithm  starts  with  an  arbitrary  configuration  of  the 
points  in  a  multidimensional  space.  Starting  from  one  point  it 
'looks  at'  every  other  point  and  decides  if  the  distance  between 
them  is  too  small  or  too  large.  It  attaches  a  vector  to  each  point 
to  correct  the  discrepancy  and  finally  te.kes  the  resultant  of  all  the 
vectors  attached  to  each  point  as  the  direction  in  which  to  move  the 
point.  It  takes,  as  the  hist,  .ice  to  move  the  points,  a  fraction  of  the 

length  of  the  resultant,  so  that  the  configuration  will  approach  a  fit 
®lowly •  It  then  moves  all  the  points;  such  a  move  is  called  a  Jiggle. 

Thi6  is  the  spirit  of  the  three  best  known  methods  of  multidimensional 
scaling— Shepard  (qp.  cit. ) ,  Kruskal  (1964a,  1964b),  and  Torgerson 
and  Meuser  (1962).  They  differ  in  detail,  especially  a6  to  the  method 
of  reducing  the  dimensionality  of  the  configuration.  The  Torgerscc 
program,  which  we  used,  performs  the  Jiggling  in  successively  lower 
dimensional  spaces.  The  lowest  dimensional  configuration  that  meets 
the  goodness- of- fit  criterion  of  the  investigator  is  the  one  us^d. 
Torgerson  gives  a  guideline  for  goodness  of  fit  but  the  >iltimate 
criterion  is  replicability.  The  Torgerson  program  also  performs  the 
factor  analytic  procedure  at  least  once  to  obtain  the  initial 
configuration. 
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The  multidimensional  scaling  model  essentially  refers  to  a  single 
individual’s  perceptual  space.  To  Improve  the  stability  and  reliability 
of  such  an  analysis  it  is  common  practice  to  combine  the  judgments  of 
several  individuals  into  a  consensus  judgment.  This  procedure  is  not 
without  risk.  To  be  valid  all  the  individual's  conceptual  spaces 
must  be  very  similar.  For  example,  in  the  document  case,  one  person 
may  make  all  his  judgments  of  similarity  on  the  basis  of  contents  of 
the  document  while  another  individual  may  make  his  Judgments  on  the 
basis  of  writing  style.  To  combine  such  judgments  would  possibly 
violate  the  assumptions  of  the  model. 

Therefore,  an  analysis  of  the  Judgments  is  performed,  a  so-called 
points- of- view  analysis  (Tucker  and  Messick,  1963).  This  analysis 
is  essentially  a  Q-type  factor  analysis  performed  on  the  cross 
products  of  the  Judgments  rather  than  on  correlations.  The  factors 
isolated  by  this  procedure  roughly  correspond  to  possible  bases  for 
making  the  judgments.  A  single  individual  may  make  his  judgments 
from  some  combination  of  bases  or  points  of  view.  Therefore,  each 
Judge  receives  a  'score'  on  each  factor  that  indicates  how  much  of 
that  point  of  view  entered  into  his  Judgments.  Judgment  matrices  for 
various  ’ideal  or  Lypotheticol '  Judges  can  be  formed  by  taking  linear 
combinations  of  the  original  Judgments,  using  as  weights  specific 
combinations  of  scores  on  the  various  factors.  In  this  fashion,  it 
is  possible  to  construct  judgment  matrices  for  hypothetical  judges 
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that  are  not  represented  in  the  sample.  In  our  work  we  did  not  make 
much  use  of  this  facility,  simply  because  we  were  not  particularly 
interested  in  "ideal"  judgments.  We  were  interested  in  finding  the 
judgment  matrix  that  best  represented  our  sample  of  Judges— what  we 
might  call  a  consensus  judgment.  Our  concern  was,  not  to  find  a 
large  number  of  points  of  view,  but  to  assure  ourselves  that  our  data 
were  not  contaminated  by  improperly  combining  variant  points  of  view. 

To  this  end,  we  inspected  the  cross  plots  of  persons'  factor  scores, 
looking  for  clusters  that  we  had  to  treat  as  different  pointB  of  view. 

Also  we  had  to  Judge,  from  the  size  of  the  largest  root  of  the  cross 
product  matrix,  if  we  could  safely  assune  that  a  single  point  of  view  would 
accurately  represent  our  sanple  of  judges. 

2.  PURPOSE  AND  METHODOLOGY  OF  THE  UTILITY  STUDY 

In  the  utility  study,  the  purpose  was  to  evaluate  the  effects  of  automatic 
index  term  assignment  and  automatic  classification  on  document 
representation.  Recall  that  we  defined  the  'goodness'  of  a  document 
representation  (surrogate)  in  terms  of  how  well  the  representations 
allow  a  user  to  make  the  same  judgments  about  the  documents  that  he 
would  make  given  the  full  text.  The  judging  procedure  is  an  arduous 
task  and  experience  dictates  that  subjects  can  perform  the  task 
on  at  most  12  or  13  full-text  documents  (this  takes  about  three 
hours);  for  representations,  the  Judging  process  goes  considerably 
faster,  taking  about  one  to  one  and  a  half  hours. 
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Three  sets  of  documents  were  selected  from  the  52  documents  used  in 
the  previous  studies.  The  first  of  these  was  selected  so  that  the 
documents  were  rather  uniformly  distributed  among  the  clusters  derived 
by  the  WD-2  method;  the  second  set  was  similarly  constructed,  using 
the  TO- 3  clusters;  the  third  set  was  chosen  on  the  basis  that  they  had 
all  been  assigned  to  the  same  category  in  Documentation  Abstracts  and 
yet  fell  into  different  clusters  on  both  the  WD-2  and  VD-3  classifi¬ 
cations.  It  was  possible  to  select  these  three  setB  bo  that  each  had 
a  subset  of  six  documents  in  common,  thus  providing  a  common  core  for 
comparison. 

The  subjects  were  divided  into  four  groups,  and  each  group  performed  a 
Judgment  task  on  each  set  of  documents  or  corresponding  surrogates. 

For  ease  in  labeling,  a  code  was  vised  to  identify  the  type  of 
representation  as  follows: 

A  =  the  full  text  of  the  document 

B  -  lists  of  machine-derived  index  terms  only 

C  =  lists  of  human- prepared  index  terms  only 

D  =  lists  of  mac bine- derived  index  terms  plus  WD-2 
classification  structure 

E  =  lists  of  machine- derived  index  terms  plus  WD-3 
classification  structure. 
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Document  sets  were  numbered,  as  follows: 

1  =  for  use  with  the  WD-2  classification 

structure 

2  =  for  use  with  the  Documentation  Abstracts 

classification  structure 

3  »  for  use  with  the  WD-3  classification 

structure. 

Hence,  a  treatment  code  of  A3  indicated  a  judgment  set  containing 
the  full  text  of  those  documents  selected  for  use  with  the  WD-3 
classification  structure,  and  so  forth.  The  experimental  design  is 
shown  in  Figure  32. 

The  design  was  severely  limited  by  the  number  of  subjects  available. 
For  reliability,  it  was  desirable  to  have  10  Judges  in  earn  set; 
however,  a  single  Judge  could  not  be  used  on  the  same  basic  material 
more  than  once  without  danger  of  cany-over  effects.  This  limited 
us  to  the  12  Judgment  sets  shown.  The  design  was  guided  by  the 
necessity  of  mking  certain  comparisons  between  surrogate  types  and 
the  full  text  and  the  fact  that  some  comparisons  between  different 
surrogate  types  vere  of  only  limited  utility  to  the  study'.  Each 
surrogate  type  vas  compared  against  at  least  two  full  text  sets. 

Notice  in  Figure  32  that  full-text  set  No.  1  was  Judged  twice,  as  A1 
«r»ri  Al’^  to  balance  out  the  design  and  to  get  an  estimate  of  the 
reliability  of  the  Judgments  of  the  full-text,  etc. 
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Session 


Subject  Group 

I 

II 

III 

W 

1 

D1 

A 2 

B3 

10 

2 

C2 

E3 

Al 

10 

3 

Al' 

C3 

02 

8* 

4 

E2 

A3 

B1 

9 

*One  subject  made  02  Judgments  and  did  not  e anklet e 
the  za.sk. 


Figure  32. 


A  Balanced  Design  of  the  Experimental 
Conditions 
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a.  Subjects 

The  experimental  subjects  wt.e  UCLA  students  in  the  Graduate  School  of 
Library  Service.  Personal  information  supplied  by  these  subjects 
indicated  that  the  majority  were  in  their  first  semester  of  graduate 
study  and  most  had  had  little  or  no  experience  with  classification. 

The  subjects  were  not  randomly  selected;  they  constituted  the  entire 
group  who  volunteered  for  the  experiment  except  for  the  one  subject 
who  did  not  complete  a~l 1  three  tasks.  Four  subjects,  who  could  not 
attend  the  regular  sessions,  were  given  the  tasks  at  £X  at  a  later 
time.  Subjects  were  randomly  assigned  to  subject  groups,  and  were 
compensated  for  their  time  at  a  rate  of  $2.50  per  hour. 


b.  Instructions 

The  subjects  were  given  a  one-hour  instruction  session  during  which 
the  general  instructions  were  read  verbatim  (Figure  33).  Additional 
instructions  arc  attached  to  the  rating  for..."  (Figure  jd) .  These 
instructions  were  substantially  the  same  for  each  rating  set  except  for 


minor  warring  for  each  type 
the  instruction  session  was 
ar.i  filling  out  a  personal 


of  material  to  be  rateu.  The  remainrer  of 
spent  on  answering  questions  about  the  task 

information  form  (Figure  35). 


Ir.  the  testing  sessions,  the  s 
acronyms  ar.^  obscure  words  (Fi 


rejects  were  provi red  vith  a  dictionary 
g.cre  36)  tnat  occurred  ir.  the  index  lis 


of 


in  an 


.tior.  to  tresr 


c.t  mate: 


‘w  <  n  -  *■  «  f*  r- 


:cse 


sue.  ect: 
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UT  Inti. 


General  Instructions  and  Orientation 


Those  of  ue  who  are  conducting  this  study  are  employed  ua  researchers  by  tne 
System  Development  Corporation  of  Santa  Monica,  which  is  a  non-profit  corporation 
specializing  in  the  design  and  development  of  large  man-machine  data  processing 
systems. 

One  such  class  of  systems  we  are  interested  in  is  library  systems,  and  the 
present  study  you  are  participating  in  is  concerned  with  one  eapect  of  the 
library  problem.  I  will  now  try  to  give  you  a  brief  6kotch  of  the  nature  of 
this  problem. 

We  are  all  aware  of  the  tremendous  Increase  in  the  number  of  scientific  and 
technical  publications  per  year.  Ifci6  increase,  sometimes  called  the  information 
explosion,  is  responeible  for  a  correspondingly  large  Increase  in  the  work  load 
of  library  workers.  Partici larly  difficult  is  the  indexing,  classification,  and 
document  retrieval  tasks.  Our  area  of  research  deals  with  the  area  of  machine- 
aided  Indexing  and  classification.  We  are  trying  to  reduce  che  work  involved 
in  Indexing  and  classifying  documents  by  introducing  machine-aided  methods  of 
both  indexing  and  classification.  However,  such  methods  will  be  of  little  use 
if  the  results  of  a  mrchine-alded  indexing  and  classification  system  are  of  no 
use  to  the  user.  Any  machine-aided  system  must  produce  an  output  that  Is  as 
easily  interpretable  as  the  currently  available  systems. 

One  of  the  tools  we  are  using  in  our  research  Ip  a  rating  procedure  known  as 
paired  comparisons.  In  this  procedure  you  will  be  asked  to  rate  (scale)  you r 
own  personal  Judgment  of  the  content  of  several  technical  articles  and  seme 
representations  (surrogates)  of  these  articles.  The  surrogates  consist  of  lists 
of  index  terms  either  man  or  machine  produced  (you  won't  know  which),  or  such 
lists  witn  added  classification  data.  All  in  all  there  will  be  three  Judging 
taa xs :  one  involving  the  full  text  of  12  articles,  one  involving  index  terms 
of  12  different  articles,  and  the  final  task  involving  index  terms  and  classi¬ 
fications  of  another  12  articles.  You  will  be  given  the  three  tasks  in  different 
orders,  that  is,  acme  of  you  will  Judge  the  articles  first,  and  sane  the  surrogates 
first,  etc. 

For  the  full  articles,  you  will  be  given  one  week  to  read  the  articles  at  your 
leisure.  For  the  surrogates,  you  will  be  given  about  one  hour  to  familiarize 
yourselves  with  them. 

the  rating  procedure  if  quite  simple.  On  the  second  cage  of  your  rating  form 
there  are  three  columns  of  pairs  of  numbers.  These  are  the  numbers  of  the 
articles.  You  are  to  take  the  pairs  of  articles  In  the  order  that  they  appear 
in  the  columns  and  look  them  over  to  refresh  your  minds  as  to  their  content. 

Then  ycu  will  make  a  numerical  estimate  of  their  apparent  similarity  using  the 
scale  on  page  one  -d'  the  rating  booklet.  Do  not  worry  about  the  exact  meaning 
of  the  scale  items;  they  are  placed  there  as  an  aid  to  you  in  using  the  scale, 
but  it  is  understood  th3t  each  of  you  will  adopt  hlB  own  personal  interpretation 
of  the  scale.  All  vt  ask  of  you  is  that  you  attempt  to  use  the  6cale  in  as 
consistent  a  manner  as  you  earn  If  you  becaae  tired,  please  feel  free  to  take 
a  break  and  leave  the  root.  You  will  be  giver  ample  time  to  complete  the  task. 


Figure  33 


General  Instructions  and 
Orientation 


However,  we  ask  that  you  do  not  discusc  any  part  of  this  task  with  others  until 
the  experiment  is  completed. 

Remember  this  scaling  task  reflects  your  personal  perceptions  of  the  s ini lari ty 
of  the  documents;  therefore,  there  is  no  right  or  wrong  answer,  You  will  not  be 
scored  in  any  such  sense.  Your  Judgments  will  be  used  in  a  subsequent  mathenstica 
analysis  of  the  various  indexing  and  classification  systems. 

If  you  desire,  a  report  of  the  results  of  the  analysis  will  be  sent  to  you  upon 
the  completion  of  the  study.  If  you  wish  such  a  report  please  indicate  so  in 
the  place  provided  on  your  personal  history  sheet. 

Also,  although  there  is  nothing  in  this  task  that  reflects  in  any  way  upon  you 
as  individuals  or  as  students,  all  responses  will  be  kept  anonymouB  according 
tc  the  rules  of  the  American  Psychological  Association.  To  aid  this,  you  have 
all  been  assigned  subject  identification  numbers;  please  place  these  numbers  and 
only  these  numbers  on  each  sheet  of  paper  given  you,  Including  all  pages  of  the 
rating  booklets.  Accuracy  in  the  use  of  these  identification  numbers  is 
extremely  important,  as  is  your  care  and  attention  to  the  rating  task.  Please 
check  all  your  work  carefully. 

Are  there  any  questions? 


Figure  33-- Concluded 


Subject  Identification  Number 


UT  1-e 


Document  Similarity  Rating  Fora 


In  this  task  you  are  asked  to  judge  the  similaixty  of  the  contents  of  pairs  of 
documents.  The  document  numbers  are  presented  in  oairs  foxlowea  by  e  b.Ltnx 
space.  For  each  pair  you  are  to  maxe  the  best  possible  estimate  of  their 
similarity  from  the  given  information  and  indicate  this  judgment  by  selecting 
the  statement  below  that  ccmes  detest  to  describing  your  judgment  and  placing 
its  number  in  the  blame  opposite  the  pair  being  judged. 


"To  me,  the  subject  contents  of  the  two  documents  would  most  likely: 

1.  be  almost  completely  similar." 

2.  be  highly  similar. " 

3.  be  quite  similar." 

k.  be  slightly  more  similar  than  different." 

5.  be  about  equally  similar  and  different." 

6.  be  slightly  more  different  than  similar. " 

7.  be  quite  different." 

8.  be  highly  different." 

9.  oe  almost  completely  different." 


Aoout  making  Judgments: 

1.  There  is  absolutely  no  basis  in  this  experiment  for  considering  any  judgment 
you  might  wish  to  make  as  more  or  less  "right”  or  "wrong."  We  desire  your 
immediate,  independent  judgment,  without  consulting  aids  such  as  authority 
lists  and  without  unduly  extended  analysis  of  the  situation. 

2.  All  document  numbers  occur  again  and  again  in  different  combinations  in  this 
exhaustive  method  of  paired  canparisons.  The  Judgment  task  can  became  quite 
onerous,  but  we  know  of  no  other  way  to  extract  the  needed  detail  of  data. 
Accordingly,  we  depend  on  you  to  pace  yourself  as  you  see  fit.  If  you  notice 
your  attention  wandering  or  an  inability  to  focus  any  longer  on  the  task, 
please  take  a  break  and  wait  until  you  are  able  to  return  with  fresh  concen¬ 
tration. 

3.  be  sure  to  place  your  subject  identification  number  on  page  2  of  the  rating 
booklet  in  the  upper  left-hand  corner.  Place  the  list  description  number 
(Al,  A2  ...  E3)  in  the  upper  right-hand  comer.  This  number  is  in  the  upper 
right-hand  corner  of  the  envelopes  containing  your  materials  and  on  each 
index  term  page  (the  list  numbers  are  not  on  the  full  text  document  repro¬ 
ductions  only  on  the  envelope  containing  them). 
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Document  Similarity  Rating  Form 
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Subject  Identification  Number  -2- 

7  -  12  _  6-12 

*  -  5  _  5  -  7 

1-6  _  1-2 

2  -  9  _  5  -  9 

10  -  11  _  7  .  io 

4  -  7  _  4  -  11 

5  -  8  1-10 

^  -  9  _  3  -  7 

1-4  _  8-9 

3  -  10  _  5-12 

3  -  11  _  3  -  8 

8  -  12  _ _  2-12 

4-6  _  4-5 

3  -  9  8-11 

1  -  12  _  6  -  6 

7  -  11  _  i*  .  io 

2  -  4  _  1  .  3 

9  -  11  _  2  -  6 

1  -  ^  _  7  -  9 

6  -  7  10-12 

3  -  10  _  11-12 

2  -  11  _  5-11 

1  -  o  _  2-10 

v  -  10  _  3  -  5 

J  -  6  _  2-7 

4  -  6  _  3  -  12 


(Lilt  Dee  Ignat  loo) 

1  -  7  _ 

2-3  _ 

6  -  10  _ 

7  -  8  _ 

6  -  ll _ 

9-12 _ 

1  -  9  _ 

1*  -  12  _ 

8-10  _ 

3  -  4  _ 

5  -  6  _ 

2  -  8  _ 

1-11  _ 

4  -  9  _ 

8  -  12  _ 

3  -  U  _ 

3  -  io _ 

1-4  _ 

6  -  9  _ 

5  -  6  _ 

-  -  7  _ 

10  -  11  _ 

2  -  9  _ 

1-6  _ 

2  -  5  _ 

7  -  12 


Figure  34 --Concluded 
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1.  Naae  _  Age  _  Sex  _ 

2.  Hone  Address  ________  Pbone  _ 

3.  University  _____________________________ 

4.  Status  (grad.,  faculty,  etc.) _ 

5.  Brief  Description  of  Education- -please  note  the  cajor  areas  you  have  studied 
both  as  an  undergraduate  and  graduate  student,  and  your  degree  object.’''*. 


6.  Work  Experience --please  list  your  aajor  Jobs,  not  part  tine  or  suner  work. 
If  in  the  library  field,  list  type  of  duties  and  length  of  tine. 


Figure  35.  Personal  Information  Form 


AfMIS 
AHU  film 
API 
ASM 

ASTIA,  ASTIA  thesaurus 

CDCR 

COBOL 

COMDEX 

COSATI 

DDC,  DDC  tnesauri 
DOD 

EJC,  EJC  thesaurus 

EL-Nikkor 

FORTRAN 

INFOL 

KWIC 

KWOC 

LEX,  project  LEX 

Lodestar 

MEDLARS 

MESH 

MHRST 

microfiche 

NMA 

iwcmc 

PAS 

HADC 

RHD 

SiNTRAN 
TEXT- 90 
T2CTC0N 


Medical  IR  system 
A  type  of  microfilm  sheet  film 
American  Petroleum  Institute 
American  Society  for  Metals 
Armed  Services  Technical  Information  Agency 
(predecessor  to  DDC) 

Center  for  Documentation  and  Communications  Research 
A  programming  language 
Concept  indexing 

Committee  on  Scientific  and  Technical  Information 

Defense  Documentation  Center 

Department  of  Defense 

Qigineers'  Joint  Council 

A  camera  lens 

Programming  language 

An  information  language  and  index  scheme 
Key  word  in  context 
Key  word  out  of  context 

DoD  project  to  develop  common  indexing  vocabulary 

Microfilm  cartridge  reader-printer 

Medical  information  retrieval  system 

Medical  subject  heading  index 

Medical  and  health  related  sciences  thesauri 

sheet  microfilm 

National  Manufacturers'  Association 

National  Union  Catalog  of  Manuscript  Collections 

Personalized  alerting  service 

Rome  Air  Development  Center 

Random  House  Dictionary 

An  indexing,  abstracting  and  retrieval  program 
An  automated  document  preparation  program 
A  program  for  converting  text  j.nto  a  better 
form  for  conqputers 


Figure  36.  Dictionary  of  Acronyms 
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had  questions  about  the  index  terms  and  the  interpretation  of  the 
tree/graph  classifications.  Questions  of  this  nature  were  answered  in 
general  terms  only,  to  avoid  unduly  influencing  the  subjects'  rating. 

We  feel  that  the  manifest  uncertainty  concerning  interpretation 
of  the  classification  information  had  a  considerable  effect  on  the 
result.  However,  it  is  impossible  to  estimate  the  size  of  this 
effect. 

c.  Rating  Procedure 

The  rating  scale  was  presented  during  the  Instruction  period  and  was 

reproduced  on  each  cf  the  rating  forms  (Figure  34).  During  the  testing 

session,  the  subjects  rated  each  pair  of  documents  or  surrogates  on  a 

9-point  scale  of  dissimilarity,  one  pair  at  a  time.  This  resulted 

(n2-n)/2  or  12^-12  »  66  Judgments.  The  first  12  Judgments  were  arranged 
2 

so  that  each  article  appeared  at  least  one  time;  these  12  Judgments  were 
repeated  at  the  end,  thus  bringing  the  total  number  of  Judgments  made  by 
each  Judge  to  66  +  12  or  78*  The  repetition  allowed  the  Judges  a 

'warm-up'  and  some  check  on  rater  reliability.  However,  many 
subjects  noted  the  repeated  items  and  were  then  instructed  to  make  the 
judgments  again  without  referring  to  their  earlier  efforts.  The  first 
12  Judgments  are  used  only  far  reliability  checks  and  a  point s-of-view 
analysis.  The  fact  that  subjects  noticed  the  repeats  is  of  little 
consequence^  since  the  major  reason  for  the  repeats  was  to  allow  for 
some  warm-up  to  take  place  in  each  session  and  to  assure  that  all  12 
documents  were  referred  to  at  least  once  before  the  major  Judging  task. 
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For  surrogate  material  the  subjects  were  given  the  rating  materials 
at  the  start  of  the  Judging  session.  For  full-text  material  they  were 
given  the  material  one  week  before  the  task,  with  instructions  to  spend 
about  six  hours  reading  and  familiarizing  themselves  with  the  material. 
They  were  allowed  to  make  any  notes  they  wished  on  the  materials  and  the 
majority  of  the  group  did  so. 

On  the  whole,  cooperation  <->f  the  subjects  was  excellent  and  they 
appeared  to  have  taken  the  task  quite  seriously  and  devoted  good  effort 
to  the  reading  and  judging  of  all  materials. 

3.  DATA  ANALYSIS 

For  the  utility  experiments,  two  separate  but  related  data  analyses 
were  performed.  The  first  was  the  points- of- view  analysis  designed  to 
insure  that  no  rater  had  a  deviant  approach  to  the  rating  task,  and 
all  judgments  in  a  set  could  be  combined.  The  second,  or  multidimensional 
scaling  analysis,  was  designed  to  determine  the  number  of  aspects  in  the 
document  or  surrogate,  such  as  the  subject  matter,  difficulty  level, 
writing  style,  etc.,  that  contribute  to  the  similarity  Judgments. 

a.  Points-of-View  Analysis 

The  points- of- view  analysis  was  adapted  from  a  FORTRAN  II  coding 
originally  done  at  the  University  of  Southern  California.  This  analysis 
follows  exactly  the  procedures  outlined  in  Tucker  and  Messick  (op.  cit.). 


I 


t 
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the  SDC  modifications  being  restricted  to  a  recoding  in  FORTRAN  IV 
(initially  for  the  709 4  IBSYS  operation  and  then  for  OS/36O-65 
operation) . 

In  this  study,  a  points- of- view  analysis  was  performed  on  each  Judgment 
set.  There  were  twelve  such  sets  (see  Figure  32),  or  unique  combination 
of  judges  and  stimuli.  Each  set  provided  an  N  x  78  matrix  in  which 
N  was  the  number  of  Judges  and  78  was  the  number  of  Judgments  each 
rater  made. 

In  ail  cases,  that  is,  in  all  twelve  sets  of  data,  the  analysis 
revealed  the  presence  of  only  one  dominant  point  of  view.  The  largest 
root  of  the  matrix  accounted  for  over  90  percent  of  the  trace.  As  a 
result,  it  was  possible  to  combine  the  individual  rater's  Jud^aente 
and  to  form  a  consensus  Jud^nent  for  each  Judgnent  set.  The  set 
consensus  was  computed  by  making  a  linear  combination  01  Jud@nents  from 
each  judge,  using  as  weights  the  judge's  score  on  the  first  factor  of 
the  points- of- view  analysis.  Thus,  each  Judge's  contribution  to  the 
consensus  was  in  proportion  to  his  'distance'  from  the  origin  of  the 
’persons’  space.  This  procedure,  except  for  a  normalizing  factor,  is 
equivalent  to  taking  a  weighted  mean  of  the  judgments  as  the  consensus. 
In  this  study,  the  points-of-viev  procedure  was  used  largely  as  a  matter 
of  convenience;  the  programs  were  already  set  up  to  work  that  way  from 
previous  research  effort#,  and  the  method  was  known  to  be  superior  to 
taking  a  simple  average  as  the  consensus,  although  an  inspection  of 
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the  factor  scores  Indicates  that  the  weighted  average  is  very 
little  different  from  a  simple  average. 

At  the  completion  of  the  pcints-of-vlev  analysis,  each  of  the  12 
experimental  treatments  had  been  reduced  to  a  single  vector  of  78 
judgments,  one  such  vector  for  each  set.  Bach  set  of  vectors  was 
rearranged  by  the  program  into  a  12  x  12  document  matrix  of  similarity 
Judgments.  A  cell  value  in  the  matrix  contained  the  consensus  rating  of 
the  similarity  of  the  pair  of  documents,  in  the  process  of  forming 
the  matrix,  the  first  12  judgments  were  deleted  since  these  were  repeated 
later  on  in  the  task.  Twelve  such  matrices  were  formed— one  for  each 
experimental  treatment.  The  matrices  were  formatted  for  direct  input 
to  the  multidimensional  scaling  program. 

b.  Multidimensional  Scaling  Analysis 
The  multidimensional  scaling  program  was  recoded  from  a  FORTRAN  II 
version  supplied  by  the  authors  (Torgerson  and  Meuser,  opc.  cit.)  into 
FORTRAN'  IV,  again  first  for  the  709^  and  then  for  the  360  computer  • 

Both  programs  performed  extremi-ely  veil  on  autuor- supplied  test  problems. 
However,  the  360  version  of  the  multidimensional  scaling  program,  for 
unexplained  reasons,  took  tnree  to  four  times  the  running  time  of  the 
709^  versions.  This  rather  unexpected  turn  cf  event 5  caused  us  to 
modify  the  normal  procedure  in  using  the  multidimensional  scaling  program. 
Usually  the  program  solves  for  the  best-fitting  space  in  successively' 


lover  dimensions,  starting  with  nine  and  Iterating  down  to  one.  The 
output  consists  of  a  matrix  of  projections  of  items  on  each  dimension, 
rotated  to  the  principal  axis  position.  This  is  a  very  lengthy  procedure 
the  expected  run  times  greatly  exceed  the  time  available  to  us  for  a 
sincle  run.  Therefore,  a  single  solution  in  the  highest  dimensional 
space,  nine,  was  used  as  the  only  solution  in  this  experiment.  Experieru 
has  indicated  that  the  first  2  or  3  of  the  principal  axis  dimensions 
extracted  change  very  little  in  the  iterative  process,  and  that,  for 
12  stimulus  objects,  the  criterion  of  fit  would  be  reached  at  about  5 
or  o  dimensions,  but  the  criterion  of  being  able  to  relate  dimensions 
obtained  under  different  experimental  conditions  would  apply  to  at  most 
the  first  3  dimensions.  The  time  consideration  was  even  more  important 
than  the  not- inconsiderable  cost.  To  follow  the  complete  iterative 
procedure  would  have  required  at  least  three  months,  given  the  operating 
constraints  now  in  existence  with  our  new  360  system.  The  possible 
variation  in  results  is  very-  si?  girt. 

Only  the  first  two  dimensions  extracted  shoved  any  positive  relation 
over  all  r datable  experimental  conditions,  60  further  analysis  was 
restricted  to  these  two  dimensions. 

SISGSAHT  OF  RESULTS  AND  CONCLUSIONS 

These  experiments  were  designed  to  measure  the  degree  of  similarity 
between  Judgments  made  'using  the  different  document  representations 
described  in  paragraph  2,  Purpose  and  Methodology  of  the  Utility  Study. 
Since  there  were  five  different  oocusent  representations,  the  meter 
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of  possible  paired  combinations  (five  objects  taken  two  at  a  time)  vets 
equal  to  ten;  that  is,  ten  different  comparisons  among  these  five 
representations  were  possible.  There  were  also  3  different  document  sets, 
and  in  a  completely  balanced  design  30  comparisons  could  be  made 
requiring  15  independent  rating  experiments.  However,  not  all  comparisons 
were  of  equal  theoretic  interest,  and  so  only  11  and  1  replication  (Al') 
were  selected  for  detailed  study.  These  12  Judgment  sets  are  listed  as 
the  column  and  row  headings  in  Figure  37-  Note  that  the  rows  are  divided 
to  provide  for  the  two  dimensions  (I  and  II)  derivea  from  the  multi¬ 
dimensional  scaling  analysis.  A  total  of  17  pairs  of  comparisons  were 
made  and  are  recorded  in  Figure  37*  These  comparisons  indicate  the  degree 
of  similarity  or  congruence  between  the  configurations  derived  from 
judgments  of  different  representations  of  the  same  documents. 


Several  different  indexes  of  congruence  have  been  suggested  in  the 
literature;  they  all  share  one  common  failing- -none  of  them  has  kno.u 
sampling  properties.  Therefore,  no  statements  of  'statistical 
significance'  can  be  made.  The  Index  used  in  this  study  is  one  suggested 
by  Tucker  (cited  in  Harmon,  p.  257)-  It  is  essentially  a  product- 
moment  type  of  Lnuex,  but  it  is  most  ue finitely  net  a 
coefficient.  The  formula  is; 


a.. 
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Where  [,a]  and  [-a]  are  the  matrices  of  prcjectior 
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Figure  3?.  Degree  of  Similarity  betveen  Judgments 
of  Different  Representations  of  the 
Same  Documents 
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As  in  a  correlation,  a  value  of  1.0  indicates  perfect  agreement  and.  0.0 
indicates  no  agreement;  since  the  sign  of  a  projection  is  arbitrary,  no 
special  meaning  can  be  attached  to  negative  indices  in  the  tabulated 
result. 

In  Figure  37,  the  index  for  the  first  dimension  is  placed  directly  above 
the  index  for  the  second  dimension.  Indices  to  the  left  of  the  double 
vertical  line  are  those  of  greatest  interest.  Only  the  lower  triangular 
matrix  of  indices  is  displayed,  since  the  full  matrix  is  symmetric 
around  the  diagonal. 

Interpretation  of  Results 

The  points- of- view  analysis,  as  was  noted,  produced  no  surprising  results 
therefore,  that  analysis  can  be  viewed  as  simply  a  stage  in  the  data 
processing  without  further  comment. 

The  results  of  the  multidimensional  scaling  analysis  are  summarized 
in  Figure  37*  In  terms  of  this  experiment,  these  indices  are  the  best 
available  summary  of  the  results.  Cross  plots  were  made  of  dimensions 
I  versus  II  for  each  judgment  set.  These  plots  were  compared  visually 
in  the  same  combinations  as  indicated  in  Figure  37-  However,  only 
subjective  estimates  of  congruence  are  possible  by  such  a  comparison, 
but  these  subjective  estimates  are  accurately  reflected  in  the  indices 
presented  (the  visual  comparisons  were  made  before  computing  the  indices) 
In  the  absence  of  a  known  sampling  distribution  of  the  congruence  index. 


-108- 

certain  ad  hoc  conventions  were  adopted.  These  at  least  follow 
accepted  practice.  An  index  of  below  .4  is  assumed  to  indicate  at 
best  a  trivial  relationship;  an  index  above  .9  indicates  a  good 
relationship;  and  the  points  in  between  are  interpolated  along  this 
scale. 

Certain  average  indices  were  computed  for  convenience  in  intepreting 
Figure  37.  Only  indices  shown  to  the  left  of  the  double  vertical  line 
were  used  in  computing  these  averages.  They  are  shown  in  the  figure 
in  parentheses,  in  the  lower  left  corner  of  the  block  (single 
horizontal  lines)  from  which  they  were  derived.  These  indices  are 
derived  from  on  the  order  of  600  judgments  per  set  and  should  be 
rather  stable. 

First,  notici  that  the  indices  between  sets*  A1  and  Al'  are  .933  and 
.727  (lines  3  and  4).  This  indicates  that;  for  our  sample  of  subjects, 
at  least  the  first  two  dimensions  (based  on  the  information-rich  full 
text)  are  reliable  and  replicable  over  different  groups.  Next,  in 
general,  the  surrogates  do  not  provide  much  information  about  the 
second  dimension;  the  highest  second  dimension  index  is  only  .579 
(line  18)  for  the  D1  condition.  However,  that  condition  is  one  of 
the  few  that  was  replicated  by  comparing  it  against  two  full-text 
configurations,  and  the  replication  index  is  only  .332,  indicating  that 

the  degree  of  relationship  is  not  very  high. 

.  . . . — . - 

The  labeling  of  these  sets  is  fully  described  in  paragraph  2  . 
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Taken  over  all,  the  human- derived  index  terms  contained  the  most 
information  about  the  first  dimension,  with  an  average  index  of 
.821  (line  15).  The  machine  terms  did  only  slightly  poorer  with 
an  average  of  .735  (line  11).  The  human  terms  might  have  a  slightly 
greater  edge  in  providing  more  second  dimension  information,  an 
average  of  .465  (line  16)  versus  .229  (line  12).  Adding  the 
classification  to  the  machine  terms  had  an  unpredicted  effect — the 
WD-2  classification  apparently  depressed  performance,  reducing  the 
average  index  for  the  first  dimension  to  .548  (line  19),  while  adding 
the  WD-3  classification  improved  things  slightly,  increasing  the  index 
to  .791  (line  23).  However,  both  classifications  did  add  some  second 
dimension  information  (lines  20  and  24),  which  was  almost  totally- 
lacking  in  the  machine  index  terms  along  line  12. 

This  result  is  somewhat  hard  to  explain,  since  the  judges  had  at  least 
as  much  to  Judge  on  with  the  added  classification  information  as  with 
Just  the  machine  index  terms.  The  decrement  in  performance  can  possibly 
be  accounted  for  by  the  expressed  difficulty  of  the  subjects  in 
interpreting  the  classification  trees.  The  fact  remains  that  subjects 
did  use  the  classification  data;  if  they  had  simply  ignored  the  trees, 
one  would  expect  no  difference  between  conditions  B  (machine  terms), 

D  (WD-2),  and  E  (Wl>-3).  However,  clearly,  WD-2  was  worse  on  Dimension  I 
than  either  WD-3  or  Just  the  machine  terms  (lines  19  versus  11  and  23). 
WD-2  was  perhaps  slightly  better  than  just  machine  terms  on  Dimension  II 
as  was  WD-3.  Further,  those  judges  that  used  classification  data  were 


t 
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relatively  consistent  among  themselves,  as  can  be  inferred  from  the  lack 
of  secondary  points  of  view  in  the  points-of-view  analysis.  The  most 
likely  interpretation  of  the  result  is  that  the  physical  layout  of  the 
WD-2  trees  led  judges  to  overweight  the  fine  distinctions  between 
documents,  represented  by  the  'leaf'  end  of  the  tree,  sinply  because 
there  were  more  of  them.  The  physical  form  of  the  WD-3  trees  did  not 
mislead  the  Judges  as  much;  'leaf  end  clusters  tend  to  go  'higher1 
in  the  tree  than  for  WD-2. 

To  rephrase  this,  it  appears  that  the  WD-2  classification  led 
the  Judges  to  consider  the  intracluster  distances  as  being  more  salient 
than  the  intercluster  distances.  Multidimensional  scaling  has  the  property 
that,  if  clusters  exist  in  the  data,  the  analysis  tends  to  disregard 
intracluster  distances,  treating  the  cluster  like  a  point.  Therefore, 
the  WD-2  classification  led  the  judges  to  consider  a  part-  of  the 
information  that  would  not  be  expected  to  show  up  in  a  multi¬ 
dimensional  scaling  analysis. 

To  check  on  this  interpretation,  a  multidimensional  scale  analysis  was 
performed  on  the  whole  5 2- document  set,  using  as  distances  the  node¬ 
height  measure  described  earlier..  As  expected,  node  height/;  that  use 
leas  than  five  were  mapped  ir.to  multidimensional  scale  distances  of  cero 
for  both  the  WD-2  and  WD~3  classifications.  The  clusters  of  node  height 
greater  than  five  were  found  in  the  cross  plots  of  the  first  two 

obtained  dimensic  is.  However,  the  WD-7;  distances  were,  Id  general, 
greater  than  fc-r  WD-2,  so  that  many  sore  clusters  were  displayed  in  the 
first  two  d.iasenkJionB,  i.e.,  there  vaa  more  inter  cluster  information. 


-Ill- 


Hove  ver,  the  surmise  must  remain  just  that,  until  more  information  is 
known  about  human  classification  performance  in  general. 

WD-3  also  fares  better  than  WD-2  in  other  ways.  Document  set  3  was 
derived  from  WD-3  clusters  of  the  whole  52-document  collection.  A 
general  comparison  shows  that  almost  every  index  involving  document  set 
3  was  higher  than  other  comparable  conditions.  Note  the  rows  and 
columns  labeled  A3,  B3j  C3,  and  E3. 

An  analysis  of  variance  would  be  inappropriate  for  these  data;  however, 
it  is  clear  by  inspection  that  there  is  a  consistent  'Document  Set 
Effect'  in  favor  of  Set  3*  This  is  explainable  if  the  WD-3  clusters  are 
'more  distinctive'  than  the  others,  and,  hence,  documents  and  surrogates 
selected  from  WD-3  clusters  are  easier  to  distinguish. 

Finally,  the  WD-3  classification  data  are  substantially  the  same  as 
the  human  index  terms,  on  both  dimensions  I  and  II.  The  average  indices 
for  both  dimensions  are  .613  (average  of  line  23  and  2i)  versus  .oil 
(average  of  line  15  and  16)  for  human  terms  and  WD-3  with  firsu-diraensio 
average  indices  of  .821  (line  15)  versus  .791  (line  2 3).  Based  upon 
these  data, it  seems  reasonable  to  assume  that,  as  judges  become  more 
experienced  in  interpreting  tree  diagrams,  the  use  of  tms  form  of 
automated  classification  would  provide  more  information  relative  to  the 
full  text  than  would  subject  heading  index  terms  alone. 


At  least,  the  relatively  inexpensive  indexing  and  classification 
represented  by  WD-3  is  very  nearly  as  informative  to  our  class  of  fudges 
as  the  much  more  costty  human- derived  index  terms  of  condition  C. 
Further,  the  machine  terms  alone  (condition  B)  are  fairly  good  relative 
to  these  same  human  terms  - 
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SECTION  VII 

INTERPRETATIONS  AND  RECOMMENDATIONS 

Id  preceding  years.  System  Development  Corporation,  under  contract  with 
Rome  Air  Development  Center,  has  developed  e  set  of  computer  programs 
for  the  automated  classification  of  documents.  These  programs,  called 
ALCAPP  (Automatic  List  Classification  and  Profile  Production) ,  were 
coded  for  use  with  the  AN/FSQ-32  computer.  In  contrast  with  most  otner 
automatic  document  classification  procedures,  the  ALCAPP  programs  are 
designed  to  be  economical  when  used  with  large  data  bases,  for  computer 
time  increases  as  a  direct  function  of  the  number  of  items  to  be 
classified.  The  programming  system  consists  of  three  parts;  the  data 
base  generator,  the  hierarchical- grouping  program,  and  the  iterative 
cluster- finding  program. 

The  current  research  project  has  as  its  purposes; 

(1)  To  recode  all  three  programs  for  use  with  RADC's  GE  635  computer; 

(2)  To  Investigate  the  statistical  reliabilities  of  the  hierarchical- 
grouping  program  under  a  variety  of  conditions; 

(3)  To  investigate  the  statistical  reliabilities  of  the  iterative 
cluster- finding  program; 

(4)  To  investigate  the  utility  of  the  machine- produced  classification 
hierarchies  for  predicting  document  content. 

The  preceding  sections  of  this  report  describe  in  detail  the  experiments 
that  were  performed  and  the  results  that  were  obtained.  In  this 
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concluding  section,  we  review  and  interpret  the  results  and  state 
our  conclusions  and  recommendations. 

1.  RECODING  THE  PROGRAMS 

All  of  the  programs  were  rewritten  in  JOVIAL  for  conpilation  on  the 
GE  635  computer.  A  detailed  description  of  the  programs  and  their 
flow  charts  is  available  in  the  Appendix  to  this  report. 

2.  HIERARCHICAL-  CLUSTERING  PROGRAMS 

The  aim  of  this  set  of  experiments  is  to  compare  the  classification 
structures  that  are  the  result,  or  output,  of  the  hierarchical-clustering 
program  when  various  input  conditions  are  manipulated.  The  conditions 
that  were  systematically  varied  and  tested  are: 

(1)  The  classification  algorithm; 

(2)  The  type  of  indexing; 

(3)  The  lepth  of  indexing; 

(4)  The  order  in  which  the  documents  are  processed. 

The  following  paragraphs  report  the  results  in  more  detail,  but  in  essence 
it  can  be  stated  that  the  output  of  the  hierarchical-clustering  program 
is  sensitive  to  variations  in  the  first  three  variables  and  relatively 
insensitive  to  order  effect, 

a.  Interpreting  the  Effect  of  the  Classification  Algorithm 
Two  different  classification  procedures  were  compared, and  it  was  determined 
that  differences  in  the  computer  program  result  in  document  clusters  that 
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are  only  jayierately  similar.  While  some  clustering  differences  were 
expected,  'die  classification  structures  were  less  similar  to  each  other 
than  anticipated. 

These  findings  statistically  support  the  view  that,  although  all 
automatic  classification  techniques  cluster  documents  on  the  basis  of 
word  similarity,  the  resulting  classification  structures  may  differ 
significantly  from  each  other,  depending  upon  the  logic  of  the  classification 
algorithm.  Just  as  manual  classification  schemes  differ  from  each  other, 
so  do  mathematically  derived  classification  systems.  They  are  not  the 
same,  and  the  utility  of  each  system  must  be  evaluated  separately. 

As  an  outcome  of  this  experiment,  it  can  be  stated  that  the  WO- 3  algorithm 
appears  to  be  slightly  more  stable  'under  a  variety  of  input  conditions  than 
is  the  WD-2  algorithm. 


low  that  the  structure  and  she  statistical  properties  of  both  algorithms 
are  known ^  it  would  seem  advantageous  to  study  met nods  of  combining  both 
logics — and  inueel  other  logics  as  well — to  develop  a  classification 
algorithm  that  woulu  be  more  satisfactory  than  either  one  separately'. 

b.  Interpreting  the  Sffect  of  the  Type  of  Inuexirg 
The  documents  used  in  the  hierarchical  classification  program  were  in-exeJ 
by  ski  lieu  librarians  anu  by  machine-aided  techniques.  The  librarians 
prepared  index  lists  of  multiple- word  concept  terms  while  the  computer 
derived  individual  key  word  index  terms.  The  classification  structures 
based  -upon  both  types  of  indexing  were  compared  for  similarity. 
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The  results  of  this  experiment  are  very  significant,  for  they  show  that 
the  automat ically  derived  classification  structures,  based  upon  the  same 
set  of  documents,  will  olffer  significantly,  depending  on  whether  the 
type  of  Indexing  used  is  key  word  or  concept.  The  experiment  demonstrates 
the  need  for  a  consistent  vocabulary  in  classifying  documents. 

Although  these  findings  are  based  upon  machine-derived  classification 
systems,  the  statistical  significance  of  the  results  cautions  against 
mixing  concept  and  key-word  indexing  in  any  document  storage  and  retrieval 
system. 


c.  Interpreting  the  Effect  of  the  Depth  of  Indexing 
A  series  of  experiments  were  designed  to  investigate  whether  differences 
in  indexing  depth  would  result  in  differences  in  the  classification 
structure.  Lists  of  6,  15,  and  30  index  terms  were  derived  from  the 
same  document,  and  these  lists  were  processed  separately  into  classification 
structures,  which  were  then  compared  for  similarity. 

It  is  concluded  from  the  results  that  the  number  of  index  terras  on  the 
lists  being  processed  can  significantly  affect  the  arrangement  and 
clustering  of  items  in  an  automatically  derived  classification  hierarchy. 
Furthermore,  this  relationship  holds  true  for  both  the  WD-2  and  WT>-5 

and  for  both  key- ward  indexing  and  concept  indexing.  There  i6, 
henrever,  an  interesting  difference  based  upon  the  type  of  indexing. 
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Tl.e  longer  the  Hot  of  key  words,  the  raore  stable  is  the  classification 
structure.  For  nuraan  indexing,  this  trend  is  reversed  emu  the  classifi¬ 
cation  structure  is  most  stable  when  derived  from  lists  containing 
relatively  few  multiple- word  index  terms. 


A  reasonable  interpretation  that  can  account  for  these  results  is  that 
a  fairly  large  number  of  key  words  are  needed  to  make  the  index  list 
an  adequate  (and  thus  stable)  representation  of  rhe  document.  This  is 
not  true  when  using  concept  indexing.  A  relatively  small  number  of 
concepts  can  adequately  describe  the  subject  matter  of  a  document. 

If  a  larger  number  is  ucod,  extraneous  concepts  are  included,  and 
classifications  structures  derived  from  these  longer  lists  are  subject 
to  chance  fluctuations  ani  are  thus  less  reliable. 


Tnese  experimental  results  are  consistent  with  the  previous  findings 
that  concept  and  key- word  indexing  should  not  be  mixed.  These  findings 
are  also  significant  in  themselves,  for  the;.-  indicate  that, certainly  for 
machine  derived  classification  structures  and  probably  in  genera.. ,  there 
is  an  optimal  number  of  index  ter— s  that  makes  for  the  most  stable 

document  surrogate,  ani  this  number  differs,  depending  on  whether 
concept  indexing  or  ke%- 


vox*  u. 
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d.  Interpreting  the  Effects  of  the  Order  in  which  Documents  Are 
Classified 

In  building  a  classification  system,  the  original  documents  to  be 
classified  tend  to  exert  a  greater  influence  on  the  resulting  structure 
than  do  later  documents,  which  are  then  simply  fitted  into  the  existing 
structure.  This  statement  seems  to  hold  true  for  all  classification  systems, 
be  they  manually  or  machine  derived.  However,  because  of  the  manner  in 
which  documents  are  paired,  the  effect  may  be  even  greater  in  automatic 
classification  procedures. 

To  investigate  the  effect  that  the  order  of  document  input  may  have, 
three  different  arrangements  were  used  and  the  resulting  classification 
structures  compared.  The  results  indicate  that,  while  there  is  some 
variation  in  the  final  classification  structure,  processing  order  is 
not  a  very  significant  factor.  It  is  also  evident  that  the  WD-3 
algorithm  is  less  sensitive  to  this  variable  than  is  WD-2.  The  use  of 
concept  indexing  will  tend  to  further  increase  the  reliability  of  the 
classification  structure. 

3.  ITERATIVE  CLUSTER-FINDING  PROGRAMS 

In  using  these  programs  it  is  necessary  to  decide  first  on  a  reasonable 
number  of  categories  and  to  arbitrarily  assign  a  few  documents  to  each  of 
these  initial  categories.  Then,  by  a  series  of  iterations, the  program  will 
divide  the  entire  document  collection  into  groups. 
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Three  questions  were  asked  about  the  operation  of  this  program: 

(1)  How  dependent  is  the  final  cluster  arrangement  on  the  initial 
arbitrary  assignment  of  documents? 

(2)  How  stable  are  the  clusters  as  new  documents  are  added  to  the 
collection? 

(3)  How  homogeneous  and  reasonable  are  the  clusters? 

The  experimental  results  on  which  the  answers  to  these  questions  are 
based  are  described  in  detail  in  Section  V  of  this  report.  The  overall 
conclusions  and  recommendations  are  that  the  iterative  cluster- finding 
program  is  sensitive  to  changes  in  the  initial  cluster  configuration,  and 
that,  therefore,  in  a  practical  application  the  initial  documents  should 
be  selectively  rather  than  arbitrarily  assigned.  By  seeding  the  clusters 
with  selected  documents,  we  will  obtain  final  categories  that  probably 
are  more  reasonable  and  homogeneous.  The  classification  categories  are 
stable,  and  additional  documents  can  be  added  to  the  collection  without 
causing  any  major  siiifts,  provided  that  thebe  new  documents  do  not 
constitute  more  than,  ten  percent  of  the  original  collection. 

Finally,  it  is  our  conclusion  that  the  automatically  derived  classification 
structure  of  a  document  collection  constitutes  a  good  initial  organization 
of  the  material,  but  teat  this  organization  can  be  improved  and  made  more 
meaningful  if  it  is  reviewed  and  modified  by  trained  personnel. 
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4.  THE  USE  OF  MACHINE-PRODUCED  CLASSIFICATION  HIERARCHIES  FOR  PREDICTING 
DOCUMENT  CONTENT 

This  set  of  experiments  was  aimed  at  determining  whether  knowing  the 
classification  category  in  which  a  document  has  been  placed  will  provide 
additional  useful  information  for  Judging  the  content  of  that  document  and 
therefore  its  pos"Lble  relevance  to  a  need  for  information.  Essentially, 
the  experimental  design  was  based  upon  making  Judgments  on  how  similar 
various  dooiment  representations  were  as  conpared  with  the  full  document. 

We  were  particularly  interested  in  knowing  whether  a  document  representation 
consisting  of  index  terms  and  classification  data  was  superior  to  a  document 
representation  of  index  terms  alone. 

A  detailed  description  of  these  experiments  and  the  results  are  available 
in  Section  VI. 

In  retrospect,  it  seems  clear  that  the  subjects  needed  more  Instruction 
and  experience  in  using  classification  trees.  Nevertheless,  although 
they  had  difficulty  in  interpreting  these  trees,  they  did  use  the 
classification  data  in  making  their  Judgments.  On  the  major  variables 
most  commonly  used  in  judging  documer  relevance,  a  knowledge  of 
classification  categories  did  not  add  much  to  the  obtained  scores. 

However,  classification  provided  other  information,  as  shown  by  the 
increased  scores  for  the  second  dimension,  and  thus  could  improve  the 


overall  Judgment  of  relevance. 
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5.  FINAL  RBCOMMENMTIONS 

The  statistical  properties  of  the  Automatic  List  Classification  and 
Profile  programs  have  been  clarified,  and  new  knowledge  has  been 
gained  about  the  strengths  and  weaknesses  of  these  programs.  To  conclude 
that  automated  document  classification  is  not  perfect  would  be  to  make 
a  true  statement  but  one  not  Dased  upon,  or  directly  derived  from,  the 
results  of  the  preceding  experiments.  It  is  a  truism,  as  would  be  the 
statement  that  no  existing  library  classification  sys  ucu  is  perfect, 
and  it  is  just  as  meaningless. 

Classification  is  a  method  of  file  organization,  and  it  is  needed  in 
both  traditional  libraries  and  in  automated  document  storage  and  retrieval 
systems. 

libraries  employ  skilled  personnel  to  analyze  the  subject  content  of  a 
document  and  tc  assign  it  a  proper  place  in  a  logically  organized 
structure.  This  is  a  time-consuming  and  expensive  task,  but  it  works 
reasonably  well.  However,  in  an  automated  system  where  every  effort  Is 
being  made  to  reduce  search  time  and  provide  faster  customer  service, 
manual  indexing  and  classification  would  be  anachronistic.  Why  ! mprove 
search  time  by  microseconds  when  It  takes  weeks  to  put  new  documents 
into  the  file?  Mechanization  of  the  input  procedure— the  initial  processing 
of  the  document  and  the  organization  of  the  file — is  a  necessity. 
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Many  researchers  have  been  working  to  achieve  this  goal.  As  is  usually 
the  case,  in  the  beginning,  great  advances,  ever  breakthroughs,  are  made. 
But  the  consolidation  of  these  gains  and  their  application  to  practical 
systems  is  a  long  and  painstaking  task.  The  research  reported  in  this 
study  is  a  small  but  necessary  step  in  making  automated  document 
classification  a  practical  reality. 
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Experiments  were  performed  to  determine  the  feasibility  of  using  ALCAPP  as  one 
form  of  on-line  dialogue. 


Assuming  the  ALCAPP  (Automatic  List  Classification  and  Profile  Pro luct  ion )  system 
is  in  an  on-line  mode,  investigations  of  tnose  parameters  which  could  affect  its 
stability  and  reliability  were  conducted.  Fifty-two  full  text  documents  were  used  to 
test  how  type  of  indexing,  depth  of  indexing,  the  classification  aigoritiia,  the  orbr. 
of  document  presentation,  and  tne  homogeneity  of  the  document  collection  would  affect 
the  hierarchical  grouping  programs  of  ALCAJT.  Gix  hundred  abstracts  were  used  to 
study  the  effect  on  document  clusters  when  more  documents  are  added  to  the  data  bate 
and  the  effect  on  the  final  cluster  arrangement  when  the  initial  assignment  of  docu¬ 
ments  to  eiustei Is  arbitrary. 

Results  reveal  that  the  Ti  1  v  time  significant  differences  in  tr.e  classification 
of  documents  does  not  occur  is  when  the  order  of  document  presentation  is  varied. 

Final  clusters  are  significantly  affected  by  the  initial  assignment  of  lorusier.t s  to  j 
clusters.  The  number  of  documents  added  to  a  data  base  allows  stability  c;  clusters 
only  to  a  cjteff  point  which  is  some  percentage  of  tne  original  r.usbe*-  of  documents 
In  the  data  base. 
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